BE A 24 Giwil Gidwani

## Experiment 4
Consider a suitable text dataset. Remove stop words, apply stemming and feature selection techniques to represent documents as vectors. Classify documents and evaluate precision, recall.

In this experiment we use a dataset consisting of reports from 20 newsgroups and create a classifier using NLP

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups

In [2]:
#fetching data
train_data = fetch_20newsgroups(remove=('headers','footers','quotes'))
X_df = pd.DataFrame(train_data.data,columns=['X'])
y_df = pd.DataFrame(train_data.target,columns=['y'])
y_df['label'] = y_df['y'].apply(lambda x:train_data.target_names[x])

In [3]:
#labels for each of the newsgroups
train_data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
X_df.head()

Unnamed: 0,X
0,I was wondering if anyone out there could enli...
1,A fair number of brave souls who upgraded thei...
2,"well folks, my mac plus finally gave up the gh..."
3,\nDo you have Weitek's address/phone number? ...
4,"From article <C5owCB.n3p@world.std.com>, by to..."


In [5]:
y_df.head()

Unnamed: 0,y,label
0,7,rec.autos
1,4,comp.sys.mac.hardware
2,4,comp.sys.mac.hardware
3,1,comp.graphics
4,14,sci.space


In [6]:
X_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11314 entries, 0 to 11313
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   X       11314 non-null  object
dtypes: object(1)
memory usage: 88.5+ KB


In [7]:
y_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11314 entries, 0 to 11313
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   y       11314 non-null  int64 
 1   label   11314 non-null  object
dtypes: int64(1), object(1)
memory usage: 176.9+ KB


In [8]:
#sample data
X_df.iloc[1]['X']

"A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks."

In [9]:
#sample class
y_df.iloc[1]

y                            4
label    comp.sys.mac.hardware
Name: 1, dtype: object

In [10]:
#tokenizing 
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

X_df['X'] = X_df['X'].apply(lambda x: word_tokenize(x.lower()))
X_df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,X
0,"[i, was, wondering, if, anyone, out, there, co..."
1,"[a, fair, number, of, brave, souls, who, upgra..."
2,"[well, folks, ,, my, mac, plus, finally, gave,..."
3,"[do, you, have, weitek, 's, address/phone, num..."
4,"[from, article, <, c5owcb.n3p, @, world.std.co..."


In [11]:
#filtering stopwords
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

X_df['X'] = X_df['X'].apply(lambda x: [w for w in x if w not in stop_words])

X_df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,X
0,"[wondering, anyone, could, enlighten, car, saw..."
1,"[fair, number, brave, souls, upgraded, si, clo..."
2,"[well, folks, ,, mac, plus, finally, gave, gho..."
3,"[weitek, 's, address/phone, number, ?, 'd, lik..."
4,"[article, <, c5owcb.n3p, @, world.std.com, >, ..."


In [12]:
#stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
X_df['X'] = X_df['X'].apply(lambda x: [stemmer.stem(w) for w in x])
X_df.head()

Unnamed: 0,X
0,"[wonder, anyon, could, enlighten, car, saw, da..."
1,"[fair, number, brave, soul, upgrad, si, clock,..."
2,"[well, folk, ,, mac, plu, final, gave, ghost, ..."
3,"[weitek, 's, address/phon, number, ?, 'd, like..."
4,"[articl, <, c5owcb.n3p, @, world.std.com, >, ,..."


In [13]:
#removing short word/characters
X_df['X'] = X_df['X'].apply(lambda x: [word for word in x if len(word)>2])

#removing short entries
for i in range(len(X_df['X'])):
  if len(X_df['X'].iloc[i])<10:
    X_df['X'].iloc[i] = None
    y_df['y'].iloc[i] = None
  
X_df.dropna(inplace=True)
y_df.dropna(inplace=True)

#concat
X_df['X'] = X_df['X'].apply(lambda x: ' '.join(x))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [14]:
#splitting dataset
y = y_df.drop('label',axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.1, random_state=69)

In [15]:
#converting words to vectors
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(X_train['X'].values)
test_vectors = vectorizer.transform(X_test['X'].values)
print(train_vectors.shape, test_vectors.shape)

(9211, 82021) (1024, 82021)


In [16]:
#training an svm model

from sklearn import svm

clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(train_vectors, y_train)

  y = column_or_1d(y, warn=True)


SVC(kernel='linear')

In [17]:
from sklearn.metrics import classification_report

predicted = clf.predict(test_vectors)

print(classification_report(y_test,predicted))


              precision    recall  f1-score   support

         0.0       0.68      0.72      0.70        39
         1.0       0.75      0.85      0.80        53
         2.0       0.72      0.79      0.75        48
         3.0       0.62      0.80      0.70        44
         4.0       0.93      0.70      0.80        54
         5.0       0.88      0.82      0.85        65
         6.0       0.77      0.81      0.79        53
         7.0       0.79      0.81      0.80        57
         8.0       0.90      0.76      0.83        59
         9.0       0.98      0.87      0.92        63
        10.0       0.95      0.97      0.96        60
        11.0       0.98      0.76      0.85        58
        12.0       0.71      0.74      0.72        62
        13.0       0.82      0.96      0.89        53
        14.0       0.76      0.86      0.81        49
        15.0       0.78      0.85      0.81        54
        16.0       0.78      0.77      0.77        47
        17.0       0.91    