# <center><u> **Assignment 2: Classification Models** </u></center>
## <center> Team: Rivonia Rodeo </center>
## <center> Topic: South African SDG Hub </center>
***

## <u>Introduction</u> ##

**Copied from the Veridical Data Science article (delete this when done)**

We propose the following steps in a notebook:
1. Domain problem formulation (narrative).  
Clearly state the real-world question and describe prior work related to
this question. Indicate how this question can be answered
in the context of a model or analysis.

2. Data collection and storage (narrative). Describe how
the data were generated, including experimental design
principles, and reasons why data is relevant to answer the
domain question. Describe where data is stored and how
it can be accessed by others.

3. Data cleaning and preprocessing (narrative, code, visualization).
Describe steps taken to convert raw data into
data used for analysis, and why these preprocessing steps
are justified. Ask whether more than one preprocessing
methods should be used and examine their impacts on
the final data results.

4. Exploratory data analysis (narrative, code, visualization).
Describe any preliminary analyses that influenced modeling
decisions or conclusions along with code and visualizations
to support these decisions.

5. Modeling and Post-hoc analysis (narrative, code, visualization).
Carry out PCS inference in the context of the
domain question. Specify appropriate model and data
perturbations. If necessary, specify null hypotheses and
associated perturbations.

6. Interpretation of results (narrative and visualization).
Translate the data results to draw conclusions and/or
make recommendations in the context of domain problem.


## 1. Domain problem formulation  
**Goal 1: To develop and train a classifier on the articles in the static database based on the 17 SDGs to improve on the performance of the existing model.**

Prior to this analysis, an exploratory data analysis was performed to assess the data and what models would be best suited to achieving the goal for this problem. A classification model was identified as being most suited and naturual language processing techniques to pre-process the abstracts.

## 2. Data collection and storage  
The data consists of abstracts of articles, the source of the article as well as its classification into one of the 17 sustainable development goals. Some of the articles are further classified into the sustainable development targets, a further set of achievales within each of the SDGs.  

The data was collected by a third party from five different internet sources. A classification model is needed to improve how each of the articles are classified into these 17 SDGs. The data is stored by the University of Pretoria but can be easily accessed from online sources or through the South African SDG Hub website, which categorises and stores wach article used in this study. 

## 3. Data cleaning and preprocessing
Various data cleaning steps are necessary to be able to model the abstract which are text data. Natural languaugae processing techniques are used to clean and prepare the data for the classification analysis.

### 3.1 Libraries 

In [None]:
%matplotlib inline

import os
import re
import string

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from bs4 import BeautifulSoup
from scipy import stats
from collections import OrderedDict
from sklearn.preprocessing import MultiLabelBinarizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

### 3.2 Import data 

In [None]:
data=pd.read_csv(r'C:\Users\TP659BF\Downloads\Masters\MIT 808\Data\OneHot_Combined_train.tsv', sep = '\t', encoding='utf-8') 
data.head()

### 3.3 Data cleaning 

In [None]:
df = data.dropna() 
df = df.drop(['Unnamed: 0'],axis=1)

In [None]:
# Clean the original data by removing unwanted characters and strings.

df_full = data

# drop all rows with no text in 'abstract' column
df_full.replace(r'\s+', ' ', regex=True, inplace=True)
df_full['abstract'].replace(' ',np.nan,inplace=True)
df_full = df_full[df_full['abstract'].notnull()]

# remove duplicates
df_full = df_full.drop_duplicates(subset='abstract')

# reindex dataframe
df_full = df_full.reset_index(drop=True)

# remove HTML
def remove_html(text):
        soup = BeautifulSoup(text, 'lxml') #install lxml
        html_free = soup.get_text()
        return html_free
df_full['preproc_text'] = df_full['abstract'].apply(remove_html)

# remove URLs
def remove_urls (text):
    url_free = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text, flags=re.MULTILINE)
    return(url_free) 
df_full['preproc_text'] = df_full['abstract'].apply(remove_urls)

# remove emails
def remove_email(text):
        no_mail = re.sub(r'\S*@\S*\s?', '', text)
        return no_mail
df_full['preproc_text'] = df_full['preproc_text'].apply(remove_email)


# remove punctuation
def remove_punctuation(text):
        no_punct = "".join([c for c in text if c not in string.punctuation])
        return no_punct
df_full['preproc_text'] = df_full['preproc_text'].apply(remove_punctuation)


# remove digits separately (may want to keep for identification of SDGs?)
def remove_digits(text):
        no_digits = ''.join(w for w in text if not w.isdigit())
        return no_digits
df_full['preproc_text'] = df_full['preproc_text'].apply(remove_digits)


# remove single letter words
def remove_singletons(text):
        no_single = re.sub(r'\b[a-zA-Z]\b', '', text)
        return no_single
df_full['preproc_text'] = df_full['preproc_text'].apply(remove_singletons)


# tokenize text and make text lowercase (change regex to include other characters)
tokenizer = RegexpTokenizer(r'\w+')
df_full['token_text'] = df_full['preproc_text'].apply(lambda x: tokenizer.tokenize(x.lower()))


In [None]:
from nltk.corpus import stopwords
# remove stopwords using english dictionary
stop_words = set(stopwords.words('english'))  

def remove_stopwords(text):
        words = [w for w in text if w not in stop_words]
        return words
df_full['nostop_text'] = df_full['token_text'].apply(remove_stopwords)

In [None]:
df_clean = df_full
df_clean = df_clean[['abstract', 'preproc_text', 'source','SDG1', 'SDG2', 'SDG3', 'SDG4', 'SDG5', 'SDG6', 'SDG7', 'SDG8', 'SDG9', 'SDG10', 'SDG11', 'SDG12', 'SDG13', 'SDG14', 'SDG15', 'SDG16', 'SDG17']]
df_clean.head()

## 4. Exploratory data analysis
Most of the EDA for this study was done in a previous assignment where a classification model was determined as being the most applicable.

In [None]:
#list of all the words in the column 'abstract'
all_words = [word for item in list(df_clean['preproc_text']) for word in item]
#all_words
#frequency distribution
fdist = FreqDist(all_words)

#top_words = fdist.most_common(10000)
#print(list(top_words))
top_words,_ = zip(*fdist.most_common(10000))
top_words = set(top_words)
#top_words

#Keep the top 10000 most common words 
def keep_top_words(text):
    return [word for word in text if word in top_words]

df_clean['preproc_text'] = df_clean['preproc_text'].apply(keep_top_words)

## 5. Modeling and Post-hoc analysis

### 5.1 Test and training set split

### 5.2 Split the training set into a further validation and training set

## 6. Interpretation of results
In the results, we can determine the effectivenes of the classification model by assessing the accuracy, predictability, sensitivity and specificity of the model.

## 7. References

- *Structure of the Notebook*  
Veridical Data Science Article   
<br>
- *EDA*  
https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190  
https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15  
https://r4ds.had.co.nz/exploratory-data-analysis.html  
https://medium.com/text-classification-algorithms/text-classification-algorithms-a-survey-a215b7ab7e2d    
<br>
- *Imbalanced data*  
https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18  
<br>
- *Modelling*   
https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validation  
<br>
- *Semi-supervised learning*  
https://www.analyticsvidhya.com/blog/2017/09/pseudo-labelling-semi-supervised-learning-technique/