<a id='Top'></a>
# U.N SDG text classification task

© Explore Data AI

---
<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/38224/logos/header.png?t=2022-08-16-13-07-24" align="left">


<a id="cont"></a>

## Table of Contents

 <a id="one"></a>
## 1. INTRODUCTION
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: INTRODUCTION ⚡ |
| :--------------------------- |
| We will address the problem statement and objectives, as well as the classification of data aspects and a brief discussion of the Climate Change Belief Analysis in this part.|

The Sustainable Development Goals (SDGs) were established in 2015 as a blueprint for peace and prosperity for people and the planet, now and into the future. The SDGs must be monitored in order to gauge development and challenges to achieving such common objectives. Teams from the United Nations evaluate streams of SDG-related papers created by governments, academia, business, and public bodies to determine how well each SDG is progressing.
Although UNEP has experts in many domains that can help in evaluating streams of SDG-related papers, connections to the SDGs outside of their areas of expertise may be missed.
so for that reason we , members of team A have been tasked to build an NLP module that would help identify SDGs based on articles fed to the module

### 1.1 Problem Statement
---

<a id="po"></a>

### 1.2 Project Objectives
---

* Clean the dataset so that it may be utilized for model development.

* Create a variety of models to identify the various SDGs.

* Using the provided Test Data, assess the model's accuracy in making predictions.

* Pick the best model for categorizing SDG articles.

<a id="dodf"></a>

### 1.3 Definition of Data Features
---
#### I. Data Source
The Sustainable Development Goals (SDGs) Community Dataset (OSDG-CD) is the end product of hundreds of volunteers' efforts on the OSDG Community Platform to advance our understanding of the SDGs (OSDG-CP). It includes hundreds of text snippets that community volunteers labeled with regard to the SDGs.

#### I. Goal Description
1. End poverty in all its forms everywhere
2. End hunger, achieve food security and improved nutrition and promote sustainable agriculture
3. Ensure healthy lives and promote well-being for all at all ages
4. Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all
5. Achieve gender equality and empower all women and girls
6. Ensure availability and sustainable management of water and sanitation for all
7. Ensure access to affordable, reliable, sustainable and modern energy for all
8. Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all
9. Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation
10. Reduce inequality within and among countries
11. Make cities and human settlements inclusive, safe, resilient and sustainable
12. Ensure sustainable consumption and production patterns
13. Take urgent action to combat climate change and its impacts
14. Conserve and sustainably use the oceans, seas and marine resources for sustainable development
15. Protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss
16. Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels
17. Strengthen the means of implementation and revitalize the Global Partnership for Sustainable Development



#### II. Column Definitions
The OSDG-CD dataset is provided in a .csv format. It is a flat tabular dataset that contains the following columns:

* doi : Digital Object Identifier of the original document
* text_id : unique text identifier;
* text : text excerpt from the document;
* sdg : the SDG the text is validated against;
* labels_negative : the number of volunteers who rejected the suggested SDG label;
* labels_positive : the number of volunteers who accepted the suggested SDG label;
* agreement : agreement score based on the formula described <a href="https://github.com/osdg-ai/osdg-data"> Here</a>

 <a id="two"></a>
## 2. Import Necessary Libraries
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Import necessary libraries ⚡ |
| :--------------------------- |
| We'd be importing all of the necessary libraries for the notebook to run smoothly..|



In [1]:
# imports for data visualisation

import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
from textblob import TextBlob
from nltk.probability import FreqDist
from wordcloud import WordCloud, ImageColorGenerator #Pip install wordcloud
import plotly.express as px
import plotly.graph_objects as go

# imports for Natural Language  Processing
import pandas as pd
import numpy as np
import nltk
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.tokenize import TreebankWordTokenizer

# imports model prosessing
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV

# imports Checking Acuracy
from sklearn.ensemble import BaggingClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize

#Model evaluation packages
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report 
from sklearn.metrics import accuracy_score, precision_score,  recall_score
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from scikitplot.metrics import plot_roc, plot_confusion_matrix

# imports for other libraires
import pickle
import warnings
warnings.filterwarnings("ignore")
import en_core_web_sm
import spacy



<a id="Three"></a>
## 3. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| The data from the `train` file is loaded into a DataFrame in this section.. |

---

In [2]:
# Importing the train & test data sets
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

# EDA Datasets
train_eda = pd.read_csv('data/train.csv')
test_eda = pd.read_csv('data/test.csv')


In [3]:
train_eda.head()

Unnamed: 0,doi,text_id,text,sdg,labels_negative,labels_positive,agreement,id
0,10.18356/5950d914-en,bf7763beb5ad7a16764d1b7fa87ab018,Indicators for targets 9.b and 9.c have data a...,9,4,5,0.111111,1
1,10.18356/5950d914-en,b6415a528064b85fdde4b4c61239ed3a,Manufacturing value added as a percentage of G...,9,0,3,1.0,2
2,10.18356/31959a6d-en,29127def7e81b999b87c8e887a4fe882,To Share or Not to Share: That is the Question...,5,2,7,0.555556,3
3,10.1787/eco/surveys-cze-2014-6-en,459db322b9e44630320fda76783e0f69,"As of 2004, parents can work without losing th...",4,2,2,0.0,4
4,10.1787/9789264119536-11-en,8b7d8c6c605fe9695d08ab03d601e0e9,A question of considerable policy relevance is...,10,1,4,0.6,5


<a id=three1></a>

#### 3.1 Set Pandas to enable viewing of all columns
Due to the length of th content of the text column, pandas cannot display all of them at once by default. While doing EDA and data cleansing, we will need to see all of the columns. When the dataframe is presented, the code below allows us to see the whole set of columns in our data collection. 

In [4]:
# Set option to display all columns
pd.set_option('display.max_colwidth', None)

<a id=three2></a>

#### 3.2 Check the "Shape" of the data-sets
As demonstrated by the shape of both datasets, the data has been separated into two sets. The form also shows that the training data set has eight columns, but the test data set has seven . Our model is designed to forecast the column that is not present in the test set. We can look for that specific item by looking for the missing entity (Column) in the test data set. After looking at both datasets, the column may be identified as the SDG column.

In [5]:
#Checking the shape of the data sets
train.shape, test.shape

((25944, 8), (6487, 7))

In [6]:
#Checking the columns of the data set
train_eda.columns, test_eda.columns

(Index(['doi', 'text_id', 'text', 'sdg', 'labels_negative', 'labels_positive',
        'agreement', 'id'],
       dtype='object'),
 Index(['doi', 'text_id', 'text', 'labels_negative', 'labels_positive',
        'agreement', 'id'],
       dtype='object'))

<a id=four1></a>

#### 3.3 Dataset summary

It is important to identify the columns that have null entries as null values can affect the performance of our model. The "isnull" function shows the number of null values that are contained in each column of the dataset. This data set is relatively clean 
Pandas dataframe.info() function is used to get a concise summary of the dataframe

In [7]:
def Summary(df):
    i = df.info()
    print ("NUL Values")
    n = df.isna().sum()
    return i,n

In [8]:
Summary(train)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25944 entries, 0 to 25943
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   doi              25944 non-null  object 
 1   text_id          25944 non-null  object 
 2   text             25944 non-null  object 
 3   sdg              25944 non-null  int64  
 4   labels_negative  25944 non-null  int64  
 5   labels_positive  25944 non-null  int64  
 6   agreement        25944 non-null  float64
 7   id               25944 non-null  int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 1.6+ MB
NUL Values


(None,
 doi                0
 text_id            0
 text               0
 sdg                0
 labels_negative    0
 labels_positive    0
 agreement          0
 id                 0
 dtype: int64)

<a id="four"></a>
## 4. Data Preprocessing (Cleaning)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

| ⚡ Description: Data Cleaning ⚡ |
| :--------------------------- |
| In this phase, we'll convert the data into a readable and desired format, as well as filter out the most relevant information.. |
# 

<a id=four1></a>
### 4.1 Preprocessing the Dataset

#### Due to the presence of a non numerical column in our datasets, some preprocessing processes must be performed, including:


* letter casing :Converting all letters to upper case or lower case is called letter casing.

* Tokenization :Tokenizing refers to the process of converting tweets to tokens. Words separated by spaces in a text are referred to as tokens.

* Noise removal: Unwanted characters such as HTML tags, punctuation marks, special characters, white spaces, and so on are removed.

* Stopwords should be removed because they don't contribute anything to the machine learning model. The nltk library can specify a list of stopwords, or it can be tailored to a particular company.

* Lemmatization: the process of reducing a word's several forms to a single form, such as converting "builds," "building," or "built" to the lemma.

---

In [9]:
#removing links
train['text'] = train['text'].str.replace('http\S+|www.\S+', '', case=False)
train['text'] = train['text'].str.replace(r's*https?://S+(s+|$)', ' ',case=False).str.strip()

In [10]:
def preprocess(text):
    """This function takes in pandas dataframe, removes URL hyperlinks, stopwords, punctuation noises,contractions and lemmatize the text."""

    tokenizer = TreebankWordTokenizer() 
    lemmatizer = WordNetLemmatizer()
    stopwords_list = stopwords.words('english')
    point_noise = string.punctuation + '0123456789'
    
    cleanText = re.sub(r'@[a-zA-Z0-9\_\w]+', '', text)#Remove @mentions
    cleanText = re.sub(r'#[a-zA-Z0-9]+', '', cleanText) #Remove '#' symbols
    cleanText = re.sub(r'RT', '', cleanText)#Remove RT from text
    #Panding Contractions
    # specific
    cleanText = re.sub(r"won\'t", "will not", cleanText)
    cleanText = re.sub(r"can\'t", "can not", cleanText)
    cleanText = re.sub(r"also", "", cleanText)
    #Panding Contractions
    # general
    cleanText = re.sub(r"n\'t", " not", cleanText)
    cleanText = re.sub(r"\'re", " are", cleanText)
    cleanText = re.sub(r"\'s", " is", cleanText)
    cleanText = re.sub(r"\'d", " would", cleanText)
    cleanText = re.sub(r"\'ll", " will", cleanText)
    cleanText = re.sub(r"\'t", " not", cleanText)
    cleanText = re.sub(r"\'ve", " have", cleanText)
    cleanText = re.sub(r"\'m", " am", cleanText)
    cleanText = ''.join([word for word in cleanText if word not in point_noise]) #Removing punctuations and numbers.
    cleanText = cleanText.lower() #Lowering case
    cleanText = "".join(word for word in cleanText if ord(word)<128) #Removing NonAscii
    cleanText = tokenizer.tokenize(cleanText) #Coverting each words to tokens
    cleanText = [lemmatizer.lemmatize(word) for word in cleanText if word not in stopwords_list] #Lemmatizing and removing stopwords
    cleanText = [word for word in cleanText if len(word) >= 2]
    # cleanText = ' '.join(cleanText)
    #return cleanText
    return cleanText

In [11]:
#applying the preprocess function
train["text"]=train["text"].apply(preprocess)


In [12]:
train.head(1)

Unnamed: 0,doi,text_id,text,sdg,labels_negative,labels_positive,agreement,id
0,10.18356/5950d914-en,bf7763beb5ad7a16764d1b7fa87ab018,"[indicator, target, data, available, globally, energy, efficiency, use, cleaner, fuel, technology, reduced, carbon, dioxide, emission, per, unit, value, added, per, cent, although, expenditure, research, development, continues, grow, globally, poorest, country, especially, africa, spend, small, proportion, gdp, expenditure, global, investment, research, development, stood, trillion, purchasing, power, parity, billion]",9,4,5,0.111111,1


<a id="five"></a>
## 5. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| We'll use a range of strategies in this part to maximize specific insights into our dataset, uncover underlying structure, extract relevant variables, find outliers and anomalies, test assumptions, and establish the optimum estimation parameters. In other words, we want to go deeper into our dataset in order to learn more about its behavior! |

---



<a id=four3></a>

#### 5.1 Most Frequent Words

In [13]:
from collections import Counter
cnt = Counter()
for message in train['text'].values:
    for word in message:
        cnt[word] += 1
        
cnt.most_common(5)

[('country', 11043),
 ('policy', 6583),
 ('woman', 5843),
 ('development', 5827),
 ('water', 5423)]

* Separate Datframes of Tweets for each Sentiment 
---

In [14]:
print("See distribution of messages per sdg : ")
count = train.groupby("sdg").count()["text"].reset_index().sort_values(by="text", ascending=False)
count.style.background_gradient(cmap="Purples")

See distribution of messages per sdg : 


Unnamed: 0,sdg,text
4,5,3438
3,4,2999
6,7,2473
5,6,2247
0,1,2190
2,3,2132
1,2,1963
10,11,1798
12,13,1695
7,8,1218


In [20]:
# Map the target variable name to their code for better understanding
sdgLables = {1: "No poverty", 2: "Zero Hunger", 3: "Good Health and well-being", 4: "Quality Education", 5: "Gender equality", 6: "Clean water and sanitation", 7: "Affordable and clean energy", 9: "Industry, Innovation and Infrustructure", 8: "Decent work and economic growth",
             10: "Reduced Inequality", 13: "Climate Action", 11: "Sustainable cites and communities", 12: "Responsible consumption and production", 14: "life below water", 15: "Life on land", 16: "Peace , Justice and strong institutions", 17: "Partnership for the goals"}
train['SDG_Labels'] = train['sdg'].map(sdgLables)

# Confirm the dataset
train.head(3)


Unnamed: 0,doi,text_id,text,sdg,labels_negative,labels_positive,agreement,id,SDG_Labels
0,10.18356/5950d914-en,bf7763beb5ad7a16764d1b7fa87ab018,"[indicator, target, data, available, globally, energy, efficiency, use, cleaner, fuel, technology, reduced, carbon, dioxide, emission, per, unit, value, added, per, cent, although, expenditure, research, development, continues, grow, globally, poorest, country, especially, africa, spend, small, proportion, gdp, expenditure, global, investment, research, development, stood, trillion, purchasing, power, parity, billion]",9,4,5,0.111111,1,"Industry, Innovation and Infrustructure"
1,10.18356/5950d914-en,b6415a528064b85fdde4b4c61239ed3a,"[manufacturing, value, added, percentage, gdp, stood, per, cent, africa, excluding, north, africa, per, cent, north, africa, comparison, figure, per, cent, latin, america, caribbean, per, cent, least, developed, country, per, cent, asia, pacific, per, cent, globally, neither, north, africa, rest, africa, made, significant, progress, first, half, current, decade, manufacturing, value, added, percentage, gdp, increased, slightly, per, cent, period, africa, excluding, north, africa, per, cent, north, africa]",9,0,3,1.0,2,"Industry, Innovation and Infrustructure"
2,10.18356/31959a6d-en,29127def7e81b999b87c8e887a4fe882,"[share, share, question, volume, gender, politics, financial, stability, holzmann, palmer, robalino, edswashington, dc, world, bank, political, economy, pension, reform, europe, handbook, ageing, social, science, binstock, george, edssan, diego, ca, academic, press]",5,2,7,0.555556,3,Gender equality


In [21]:
#for a better visualisation let view by the lables
print("See distribution of messages per sdg : ")
count = train.groupby("SDG_Labels").count()["text"].reset_index().sort_values(by="text", ascending=False)
count.style.background_gradient(cmap="Purples")

See distribution of messages per sdg : 


Unnamed: 0,SDG_Labels,text
4,Gender equality,3438
9,Quality Education,2999
0,Affordable and clean energy,2473
1,Clean water and sanitation,2247
8,No poverty,2190
5,Good Health and well-being,2132
13,Zero Hunger,1963
12,Sustainable cites and communities,1798
2,Climate Action,1695
3,Decent work and economic growth,1218


In [None]:
fig = px.treemap(count, path=['SDG_Labels'],title='Treemap chart by Top Persons in the whole data',
                 values='text', color= "text")