# PART 1 : Blog Classification problem

## 1. Read and Analyse Dataset.

### A. Clearly write outcome of data analysis

In [1]:
from zipfile import ZipFile

with ZipFile('blogs.zip', 'r') as zipdata:
    data_csv = zipdata.open('blogtext.csv')
    
import pandas as pd

df = pd.read_csv(data_csv)

print (df.columns)

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')


In [2]:
df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [3]:
del data_csv

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      681284 non-null  int64 
 1   gender  681284 non-null  object
 2   age     681284 non-null  int64 
 3   topic   681284 non-null  object
 4   sign    681284 non-null  object
 5   date    681284 non-null  object
 6   text    681284 non-null  object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


In [5]:
## Check the nulls
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [6]:
## topic is the target variable.
## Let's check the distribution of topic 

print (df ['topic'].value_counts())

indUnk                     251015
Student                    153903
Technology                  42055
Arts                        32449
Education                   29633
Communications-Media        20140
Internet                    16006
Non-Profit                  14700
Engineering                 11653
Law                          9040
Publishing                   7753
Science                      7269
Government                   6907
Consulting                   5862
Religion                     5235
Fashion                      4851
Marketing                    4769
Advertising                  4676
BusinessServices             4500
Banking                      4049
Chemicals                    3928
Telecommunications           3891
Accounting                   3832
Military                     3128
Museums-Libraries            3096
Sports-Recreation            3038
HumanResources               3010
RealEstate                   2870
Transportation               2326
Manufacturing 

Topic has total 39 distinct values with distribution as above

In [7]:
print ('size of the dataframe:',df.size)
print ('shape of the dataframe:',df.shape)

size of the dataframe: 4768988
shape of the dataframe: (681284, 7)


### B. Clean the Structured Data

### 2. Preprocess unstructured data to make it consumable for model training.

A. Eliminate All special Characters and Numbers 

B. Lowercase all textual data 

C. Remove all Stopwords 

D. Remove all extra white spaces


In [8]:
# Select only alphabets
import re
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

# Convert text to lowercase
df.text = df.text.apply(lambda x: x.lower())

# Strip unwanted spaces
df.text = df.text.apply(lambda x: x.strip())

# Remove stopwords
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

In [9]:
df.text[6]

'somehow coca cola way summing things well early flagship jingle like buy world coke tune like teach world sing pretty much summed post woodstock era well add much sales catchy tune korea coke theme urllink stop thinking feel pretty much sums lot korea koreans look relaxed couple stopped thinking started feeling course high regard education math logic deep think many koreans really like work emotion anything else westerners seem sublimate moreso least display different way maybe scratch westerners koreans probably pretty similar context different anyways think losing korea repeat stop thinking feel stop thinking feel stop thinking feel everything alright'

### Build a base Classification model

A. Create dependent and independent variables 

[ Hint: Treat ‘topic’ as a Target variable.]

B. Split data into train and test. 

C. Vectorize data using any one vectorizer. 

D. Build a base model for Supervised Learning - Classification. 

E. Clearly print Performance Metrics. 

In [10]:
## Split test and train data.
## The target variable is topic and training variable is text.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.text.values, df.topic.values, test_size=0.20, random_state=42)


print('Training utterances: {}'.format(X_train.shape[0]))
print('Validation utterances: {}'.format(X_test.shape[0]))

Training utterances: 545027
Validation utterances: 136257


In [11]:
## Let's use TF-IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)

TfidfVectorizer()

In [12]:
X_train_tf = vectorizer.transform(X_train)
X_test_tf = vectorizer.transform(X_test)
X_train_tf, X_test_tf

(<545027x557241 sparse matrix of type '<class 'numpy.float64'>'
 	with 41239292 stored elements in Compressed Sparse Row format>,
 <136257x557241 sparse matrix of type '<class 'numpy.float64'>'
 	with 10212902 stored elements in Compressed Sparse Row format>)

In [13]:
## Let's print few feature names
vectorizer.get_feature_names()[:5]

['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa']

In [14]:
## Let's use Naive Bayes classifier for the prediction

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train_tf, y_train)
predicted = clf.predict(X_test_tf)


In [15]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

print("Accuracy Score:",accuracy_score(y_test, predicted))
print('F1 score: ', f1_score(y_test, predicted, average='micro'))
print('Precision score: ', precision_score(y_test, predicted,average='micro'))
print('Recall score: ', recall_score(y_test, predicted, average='micro'))

Accuracy Score: 0.3911872417563868
F1 score:  0.39118724175638675
Precision score:  0.3911872417563868
Recall score:  0.3911872417563868


### Improve Performance of model. 

A. Experiment with other vectorisers.

B. Build classifier Models using other algorithms than base model.

C. Tune Parameters/Hyperparameters of the model/s. 

D. Clearly print Performance Metrics. 

Hint: Accuracy, Precision, Recall, ROC-AUC

In [None]:
## Let's use a different vectoriser
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_cnt = CountVectorizer(binary=True, ngram_range=(1, 4))
X_train_cnt = vectorizer_cnt.fit_transform(X_train)
X_test_cnt = vectorizer_cnt.transform(X_test)

In [None]:
## print top 5 feature names
vectorizer_cnt.get_feature_names()[:5]

In [None]:
## Let's use Adaboost for the prediction purpose
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators=40, random_state=1)
abcl = abcl.fit(X_train_cnt, y_train)

In [None]:
predicted_cnt = abcl.predict(X_test_cnt)

In [None]:
print("Accuracy Score with count vectorizer/Adaboost:",accuracy_score(y_test, predicted_cnt))
print('F1 score  with count vectorizer/Adaboost ', f1_score(y_test, predicted_cnt, average='micro'))
print('Precision score  with count vectorizer/Adaboost:', precision_score(y_test, predicted_cnt,average='micro'))
print('Recall score  with count vectorizer/Adaboost:', recall_score(y_test, predicted_cnt, average='micro'))

In [None]:
## Let's try to tune TF-IDF vectorizer
## Let's change the parameter max_df to 150 and min_df to 20

vectorizer_tuned = TfidfVectorizer(max_df = 150, min_df = 0.01)
vectorizer_tuned.fit(X_train)

X_train_tuned = vectorizer_tuned.transform(X_train)
X_test_tuned = vectorizer_tuned.transform(X_test)

In [None]:
## Let's print few feature names
print (vectorizer_tuned.get_feature_names()[:5])

## Let's use BaggingClassifier classifier for the prediction

from sklearn.svm import SVC

clf = SVC(gamma=0.4, C=3)

clf.fit(X_train_tuned , y_train)


In [None]:
## Let's predict and print the scores
predicted_tuned = clf.predict(X_test_tuned)

print("Accuracy Score with TFIDF tuned vectorizer/svm:",accuracy_score(y_test, predicted_tuned))
print('F1 score  with TFIDF tuned vectorizer/svm ', f1_score(y_test, predicted_tuned, average='micro'))
print('Precision score  with TFIDF tuned vectorizer/svm:', precision_score(y_test, predicted_tuned,average='micro'))
print('Recall score  with TFIDF tuned vectorizer/svm:', recall_score(y_test, predicted_tuned, average='micro'))


#### The Accuracy of the model increases significantly after tuning the TF-IDF vectorizer and using the different classifier like Bagging

### Share insights on relative performance comparison

#### A. Which vectorizer performed better? Probable reason?.

Ans: TFIDF clearly outperformed over count vectorizer. Probable reason is:

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. 

#### B. Which model outperformed? Probable reason? 

Ans: The model with TFIDF as the vectorizer with tuning added with SVM was the best with the accuracy score of 0.7. TFIDF is the better vectorizer considering the algorithm where it focuses on the importance of the words.
SVM algorithm looks at the relations between multiple parameters and hence becomes probably a better classifier for the NLP text classification problems.

#### C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?. 

The right value of gamma in the SVC classifier and the max_df and min_df specifications in the TFIDF vectorizer helped improve the performance significantly





#### D. According to you, which performance metric should be given most importance, why?. 

F1 Score makes more sense in evaluation considering that this provides the right mix of precision and recall.