# Lab 7

## Text Based Analysis

## Stopwords

Stopwords are words that are so widely used that they carry very little useful information. 

Ex: "a", "are", "the", "is"

Domain specific stop words can also be used.

Main reason to exclude these words is to save on memory and proccessing time.

## Stemming

Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or to the roots of words known as "lemmas".

Ex: "Dancing" , "Danced", "Dancer", "Dances" can all be stemmed to the word "Dance". 

## TF-IDF

The basic idea behind TF-IDF is to weight words based on how often they appear in a document, and how rare they are across all documents in a collection.

This scheme gives a higher weight to words that are more relevant to the document and less common across all documents, and lower weight to words that are less relevant or more common.

## Cosine Similarity

The cosine similarity between two vectors can range from -1 to 1.

A value of 1 indicates that the vectors are identical.

A value of 0 indicates that the vectors are orthogonal (unrelated).

A value of -1 indicates that the vectors are completely dissimilar.

## Sentiment Analysis

Typically words are assigned a value using a Naive Bayes Classifier.

This has a fair ammount of drawbacks, mainly that words are dependent on the context around them, and are not unrelated, but it does a good enough job most of the time.

## Example

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score

# suppress warnings
import warnings
warnings.filterwarnings('ignore')


In [20]:
# Load some data from UN debates
df = pd.read_csv('https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/un-general-debates/un-general-debates-blueprint.csv.gz')
df.head()

Unnamed: 0,session,year,country,country_name,speaker,position,text
0,25,1970,ALB,Albania,Mr. NAS,,33: May I first convey to our President the co...
1,25,1970,ARG,Argentina,Mr. DE PABLO PARDO,,177.\t : It is a fortunate coincidence that pr...
2,25,1970,AUS,Australia,Mr. McMAHON,,100.\t It is a pleasure for me to extend to y...
3,25,1970,AUT,Austria,Mr. KIRCHSCHLAEGER,,155.\t May I begin by expressing to Ambassado...
4,25,1970,BEL,Belgium,Mr. HARMEL,,"176. No doubt each of us, before coming up to ..."


In [21]:
# How many speeches from the United Kingdom?
df[df['country'] == 'GBR'].shape

(46, 7)

In [22]:
# How many speeches from the Canada?
df[df['country'] == 'CAN'].shape

(46, 7)

In [23]:
# Download a set of common stop words and then add and remove a few extra for our problem
nltk.download('stopwords')

stopwords = set(nltk.corpus.stopwords.words('english'))

include_stopwords = {'dear', 'regards', 'must', 'would', 'also', 
                     'canada', 'canadian', 'canadians', 'prime', 'minister', 'province', 'provinces'
                     'united', 'kingdom', 'state', 'british', 'irish', 'england','scotland', 'ireland', 'northern'}
exclude_stopwords = {'against'}

stopwords |= include_stopwords
stopwords -= exclude_stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\naumh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
# build a text processing and classifier pipeline
# to predict the country (GBR (UK) or Canada) of a speech

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

df2 = df[df['country'].isin(['GBR', 'CAN'])]

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df2['text'], df2['country'], test_size=0.2)

# Create a pipeline that first transforms the text data into TF-IDF vectors, then applies SVM
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=list(stopwords))),
    ('clf', svm.SVC()),
])

# Train the classifier
text_clf.fit(X_train, y_train)

# Predict the test set results
y_pred = text_clf.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred, target_names=['GBR', 'CAN']))


              precision    recall  f1-score   support

         GBR       1.00      1.00      1.00         9
         CAN       1.00      1.00      1.00        10

    accuracy                           1.00        19
   macro avg       1.00      1.00      1.00        19
weighted avg       1.00      1.00      1.00        19



In [25]:
# This script creates a new column 'sentiment' in the dataframe, 
# which contains the sentiment score of the text. 
# The sentiment score is a float within the range [-1.0, 1.0], 
# where -1.0 denotes a very negative sentiment, 
# 1.0 denotes a very positive sentiment, 
# and values around 0 denote a neutral sentiment.

from textblob import TextBlob

# Define a function to apply sentiment analysis to a text
def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity  # returns a value between -1 and 1

# Create a new column 'sentiment' in the DataFrame
df2['sentiment'] = df2['text'].apply(get_sentiment)

# Display the DataFrame
df2.head().sort_values(by='sentiment')

Unnamed: 0,session,year,country,country_name,speaker,position,text,sentiment
18,25,1970,GBR,United Kingdom,Sir Alec DOUGLASHOME,,"110.\t Mr. President, I should like first to s...",0.049427
8,25,1970,CAN,Canada,Mr. SHARP,,\nThe General Assembly is fortunate indeed to ...,0.107194
105,26,1971,GBR,United Kingdom,Sir Alec DOUGLAS-HOME,,"79.\tMr. President, I should like in the begin...",0.110666
204,27,1972,CAN,Canada,Mr. Sharp,,"Mr. President, the Canadian delegation looks f...",0.117961
84,26,1971,CAN,Canada,Mr- Sharp,,"48.\t May I first offer you, Sir, the fiM sup...",0.1368


In [26]:
# find average sentiment for each country in df2
df2.groupby('country')['sentiment'].mean()

country
CAN    0.112540
GBR    0.107741
Name: sentiment, dtype: float64

In [27]:
# find average sentiment for each speaker in df2
# order the results from most positive to most negative

df2.groupby('speaker')['sentiment'].mean().sort_values(ascending=False)

speaker
Malcolm Rifkind             0.154614
Lawrence Cannon             0.152515
Mr. Philip Hammond          0.150041
Leonard Edwards             0.145419
Mr. MACGUIGAN               0.144525
                              ...   
Pierre Stewart Pettigrew    0.079461
Carrington                  0.078381
Paul Martin                 0.075369
MacEACHEN                   0.074013
Sir Alec DOUGLASHOME        0.049427
Name: sentiment, Length: 67, dtype: float64

In [28]:
# Can do it by year as well
df2.groupby('year')['sentiment'].mean().sort_values(ascending=False)

year
2010    0.139200
1989    0.138822
2009    0.135793
1998    0.135411
1990    0.132206
1991    0.131901
1993    0.131544
2015    0.128994
1995    0.128064
1976    0.127838
1994    0.127015
2003    0.126351
1971    0.123733
1996    0.122056
1981    0.121182
1997    0.119208
1988    0.118922
1974    0.117654
1973    0.115815
2007    0.115423
2008    0.113624
1972    0.112794
1984    0.111963
1992    0.111177
2006    0.108345
2014    0.106854
2004    0.106391
1977    0.104993
1978    0.104992
2011    0.104768
2013    0.103519
1983    0.103454
1975    0.100152
1979    0.099571
1985    0.098452
2001    0.098188
1987    0.095949
2000    0.095095
1986    0.094259
2012    0.093486
2005    0.088541
1982    0.079657
1970    0.078311
1980    0.078242
1999    0.076149
2002    0.060399
Name: sentiment, dtype: float64

## STUDENT SECTION

In [29]:
# Make a copy of df2 called df3 with deep=True 
df3 = df2.copy(deep=True)


# Create a new column called target which is True if the year is greater than or equal to 2000 and False otherwise
df3['target']=df3['year'] >= 2000

# Split the dataset into training and test sets with a test_size 0.2
X_train, X_test, y_train, y_test = train_test_split(df3['text'], df3['target'], test_size=0.2)


# Create a pipeline that first transforms the text data into TF-IDF vectors, then applies SVM
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=list(stopwords))),
    ('clf', svm.SVC()),
])

# Train the classifier
text_clf.fit(X_train, y_train)

# Predict the test set results
y_pred = text_clf.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred, target_names=['True', 'False']))
df3

              precision    recall  f1-score   support

        True       0.85      1.00      0.92        11
       False       1.00      0.75      0.86         8

    accuracy                           0.89        19
   macro avg       0.92      0.88      0.89        19
weighted avg       0.91      0.89      0.89        19



Unnamed: 0,session,year,country,country_name,speaker,position,text,sentiment,target
8,25,1970,CAN,Canada,Mr. SHARP,,\nThe General Assembly is fortunate indeed to ...,0.107194,False
18,25,1970,GBR,United Kingdom,Sir Alec DOUGLASHOME,,"110.\t Mr. President, I should like first to s...",0.049427,False
84,26,1971,CAN,Canada,Mr- Sharp,,"48.\t May I first offer you, Sir, the fiM sup...",0.136800,False
105,26,1971,GBR,United Kingdom,Sir Alec DOUGLAS-HOME,,"79.\tMr. President, I should like in the begin...",0.110666,False
204,27,1972,CAN,Canada,Mr. Sharp,,"Mr. President, the Canadian delegation looks f...",0.117961,False
...,...,...,...,...,...,...,...,...,...
6988,68,2013,GBR,United Kingdom,Nicholas Clegg,Deputy Prime Minister,"In my lifetime, the \nworld has been sliced up...",0.123690,True
7149,69,2014,CAN,Canada,Stephen Harper,Prime Minister,It is both \nan honour and a pleasure for me t...,0.151988,True
7181,69,2014,GBR,United Kingdom,David Cameron,Prime Minister,This year we \nface extraordinary tests of our...,0.061720,True
7343,70,2015,CAN,Canada,Mr. Daniel Jean,Deputy Minister Foreign Affairs,I am honoured to appear before the Assembly to...,0.107947,True


In [30]:
# find average sentiment for every year in your DataFrame
df3.groupby('year')['sentiment'].mean()

year
1970    0.078311
1971    0.123733
1972    0.112794
1973    0.115815
1974    0.117654
1975    0.100152
1976    0.127838
1977    0.104993
1978    0.104992
1979    0.099571
1980    0.078242
1981    0.121182
1982    0.079657
1983    0.103454
1984    0.111963
1985    0.098452
1986    0.094259
1987    0.095949
1988    0.118922
1989    0.138822
1990    0.132206
1991    0.131901
1992    0.111177
1993    0.131544
1994    0.127015
1995    0.128064
1996    0.122056
1997    0.119208
1998    0.135411
1999    0.076149
2000    0.095095
2001    0.098188
2002    0.060399
2003    0.126351
2004    0.106391
2005    0.088541
2006    0.108345
2007    0.115423
2008    0.113624
2009    0.135793
2010    0.139200
2011    0.104768
2012    0.093486
2013    0.103519
2014    0.106854
2015    0.128994
Name: sentiment, dtype: float64