
# Final Project: NLP

>It’s a little like a theater where someone yells 'Fire!'.**Glaberson 1987**


![image.png](attachment:image.png)


## 1. Preprocessing
##### exploratory analysis
##### Beautiful Soup
##### lowercase all
##### stopword removal
##### tokenization
##### stemming or lemmatization


## 2. Processing Measures
##### Bag the words- word frequency
##### Co-occurrence
##### Named Entity Recognition
##### Sentiment Analysis
##### Term Frequency Inverse Document Frequency (TF-IDF)

A term frequency inverse document frequency (TF-IDF) matrix creates a weighted score for how important a term is per document in a corpus. The first is the ‘term frequency (TF),’ or how often the word appeared within that document. It makes intuitive sense that if a word appears many times in a document, that the document is about something related to that term. The second attribute is the ‘inverse document frequency (IDF),’ a measure of what proportion of the documents the word appeared in. If a word appears in all documents, its weight should be reduced. Conversely, if a word appears only in few documents, it should be highly weighted for those documents.

The formula for TF-IDF is: tfidf(t,d,D) = tf(t,d) * idf(t,D)

## 3. Topic Models


## 1. Preprocessing
##### exploratory analysis
##### Beautiful Soup
##### lowercase all
##### stopword removal
##### tokenization
##### stemming or lemmatization


## 2. Processing Measures
##### Bag the words- word frequency
##### Co-occurrence
##### Named Entity Recognition
##### Sentiment Analysis
##### Term Frequency Inverse Document Frequency (TF-IDF)

A term frequency inverse document frequency (TF-IDF) matrix creates a weighted score for how important a term is per document in a corpus. The first is the ‘term frequency (TF),’ or how often the word appeared within that document. It makes intuitive sense that if a word appears many times in a document, that the document is about something related to that term. The second attribute is the ‘inverse document frequency (IDF),’ a measure of what proportion of the documents the word appeared in. If a word appears in all documents, its weight should be reduced. Conversely, if a word appears only in few documents, it should be highly weighted for those documents.

The formula for TF-IDF is: tfidf(t,d,D) = tf(t,d) * idf(t,D)

## 3. Topic Models

In [1]:
import nltk
nltk.download('punkt')

import glob
import numpy as np
import matplotlib.pyplot as plt
import gensim
import pandas as pd
import math
import pickle
from bs4 import BeautifulSoup

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\faria\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


## Initial Exploratory Analysis

In [2]:
pwd

'C:\\Users\\faria\\OneDrive\\Desktop\\Final_Tiago'

In [3]:
ls

 Volume in drive C is Windows
 Volume Serial Number is 3CA7-BD92

 Directory of C:\Users\faria\OneDrive\Desktop\Final_Tiago

17/10/2022  10:47    <DIR>          .
16/10/2022  00:04    <DIR>          ..
17/10/2022  10:47    <DIR>          .ipynb_checkpoints
04/12/1996  19:27               186 all-exchanges-strings.lc.txt
04/12/1996  19:27               316 all-orgs-strings.lc.txt
04/12/1996  19:27             2ÿ474 all-people-strings.lc.txt
04/12/1996  19:27             1ÿ721 all-places-strings.lc.txt
04/12/1996  19:27             1ÿ005 all-topics-strings.lc.txt
02/06/2022  19:38           600ÿ686 Captura de ecrÆ 2022-06-02 193821.png
13/06/2022  13:26               393 Code_1.txt
23/06/2022  11:17        17ÿ426ÿ597 DATA_CSV
23/06/2022  11:04        17ÿ426ÿ597 DATA_CSV.txt
09/09/2022  13:15           565ÿ292 Dudensing_Paystubs_BuckleyLaw.pdf
22/06/2022  18:56           137ÿ418 EA.ipynp
10/12/1996  04:32           273ÿ802 feldman-cia-worldfactbook-data.txt
01/06/2022  15:45             1

## Need to extract the SGM files

In [4]:
# a quick method to filter out the sgm files and create a list
files = glob.glob("*.sgm")
print(files)

['reut2-000.sgm', 'reut2-001.sgm', 'reut2-002.sgm', 'reut2-003.sgm', 'reut2-004.sgm', 'reut2-005.sgm', 'reut2-006.sgm', 'reut2-007.sgm', 'reut2-008.sgm', 'reut2-009.sgm', 'reut2-010.sgm', 'reut2-011.sgm', 'reut2-012.sgm', 'reut2-013.sgm', 'reut2-014.sgm', 'reut2-015.sgm', 'reut2-016.sgm', 'reut2-017.sgm', 'reut2-018.sgm', 'reut2-019.sgm', 'reut2-020.sgm', 'reut2-021.sgm']


In [5]:
#curious about the number of files
len(files)

22

In [6]:
#while there is only 22 files, you can see that there are 21,578 documents, which my collegues have determined contain some duplicates
docs=[]
for file in files:  
    with open(file, encoding='cp1252') as text:
        article = text.read()
        soup = BeautifulSoup(article, 'html.parser')
        for a in soup.find_all('reuters'):
            docs.append(a)
            #print(docs) tried but computer crashed
len(docs)

21578

In [7]:
#clean the dates up
clean_dates = []
for i in range(0, len(docs)):
    if docs[i].date:
        clean_dates.append(docs[i].date.text)
print(clean_dates)

['26-FEB-1987 15:01:01.79', '26-FEB-1987 15:02:20.00', '26-FEB-1987 15:03:27.51', '26-FEB-1987 15:07:13.72', '26-FEB-1987 15:10:44.60', '26-FEB-1987 15:14:36.41', '26-FEB-1987 15:14:42.83', '26-FEB-1987 15:15:40.12', '26-FEB-1987 15:17:11.20', '26-FEB-1987 15:18:06.67', '26-FEB-1987 15:18:59.34', '26-FEB-1987 15:19:15.45', '26-FEB-1987 15:20:13.09', '26-FEB-1987 15:20:27.17', '26-FEB-1987 15:20:48.43', '26-FEB-1987 15:21:16.13', '26-FEB-1987 15:24:48.56', '26-FEB-1987 15:26:26.78', '26-FEB-1987 15:26:54.12', '26-FEB-1987 15:32:03.12', '26-FEB-1987 15:33:23.61', '26-FEB-1987 15:34:07.03', '26-FEB-1987 15:34:16.30', '26-FEB-1987 15:35:16.67', '26-FEB-1987 15:35:39.38', '26-FEB-1987 15:36:44.78', '26-FEB-1987 15:36:53.42', '26-FEB-1987 15:38:26.23', '26-FEB-1987 15:39:41.92', '26-FEB-1987 15:41:56.54', '26-FEB-1987 15:43:14.36', '26-FEB-1987 15:43:59.53', '26-FEB-1987 15:44:36.04', '26-FEB-1987 15:45:19.65', '26-FEB-1987 15:45:26.55', '26-FEB-1987 15:45:35.37', '26-FEB-1987 15:45:39.20', 

In [8]:
# create a dataframe
df_dates = pd.DataFrame({
    'Date': clean_dates
})
# display the dataframe
print(df_dates)

                          Date
0      26-FEB-1987 15:01:01.79
1      26-FEB-1987 15:02:20.00
2      26-FEB-1987 15:03:27.51
3      26-FEB-1987 15:07:13.72
4      26-FEB-1987 15:10:44.60
...                        ...
21573  19-OCT-1987 00:34:08.94
21574  19-OCT-1987 00:18:22.79
21575  19-OCT-1987 00:05:11.26
21576  19-OCT-1987 00:03:21.69
21577  19-OCT-1987 13:30:41.59

[21578 rows x 1 columns]


In [9]:
titles = []
for i in range(0, len(docs)):
    if docs[i].title:
        titles.append(docs[i].title.text)
print(titles)



In [10]:
#created a simple for loop to see all the titles and dates together
for i in docs: 
    info = i.title.text, i.date.text
    print(info)
len(info)

('BAHIA COCOA REVIEW', '26-FEB-1987 15:01:01.79')
('STANDARD OIL <SRD> TO FORM FINANCIAL UNIT', '26-FEB-1987 15:02:20.00')
('TEXAS COMMERCE BANCSHARES <TCB> FILES PLAN', '26-FEB-1987 15:03:27.51')
('TALKING POINT/BANKAMERICA <BAC> EQUITY OFFER', '26-FEB-1987 15:07:13.72')
('NATIONAL AVERAGE PRICES FOR FARMER-OWNED RESERVE', '26-FEB-1987 15:10:44.60')
('ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS', '26-FEB-1987 15:14:36.41')
('RED LION INNS FILES PLANS OFFERING', '26-FEB-1987 15:14:42.83')
("USX <X> DEBT DOWGRADED BY MOODY'S", '26-FEB-1987 15:15:40.12')
('CHAMPION PRODUCTS <CH> APPROVES STOCK SPLIT', '26-FEB-1987 15:17:11.20')
('COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE', '26-FEB-1987 15:18:06.67')
('COBANCO INC <CBCO> YEAR NET', '26-FEB-1987 15:18:59.34')
('OHIO MATTRESS <OMT> MAY HAVE LOWER 1ST QTR NET', '26-FEB-1987 15:19:15.45')
('AM INTERNATIONAL INC <AM> 2ND QTR JAN 31', '26-FEB-1987 15:20:13.09')
('BROWN-FORMAN INC <BFD> 4TH QTR NET', '26-FEB-1987 15:20:27.17')
('NATIONAL

AttributeError: 'NoneType' object has no attribute 'text'

### Goal is to create a dataframe and do some analysis on the df 
### Ultimate goal is to see if what the financial climate was during 1987 and if there was a correlation with the politics at that time

In [26]:
#needed to create a function to extract the titles, dates and body of the documents so I could create a df
titles = []
dates = []
texts = []


for i in range(0,len(docs)):
    if docs[i].body != None:
        titles.append(docs[i].title.text)
        dates.append(docs[i].date.text)
        texts.append(docs[i].body.text)

In [27]:
#df is created but further preprocessing still needs to be done
df = pd.DataFrame(list(zip(titles, dates, texts)), columns = ['Title','Date','Text'])
df.head()

Unnamed: 0,Title,Date,Text
0,BAHIA COCOA REVIEW,26-FEB-1987 15:01:01.79,Showers continued throughout the week in\nthe ...
1,STANDARD OIL <SRD> TO FORM FINANCIAL UNIT,26-FEB-1987 15:02:20.00,Standard Oil Co and BP North America\nInc said...
2,TEXAS COMMERCE BANCSHARES <TCB> FILES PLAN,26-FEB-1987 15:03:27.51,Texas Commerce Bancshares Inc's Texas\nCommerc...
3,TALKING POINT/BANKAMERICA <BAC> EQUITY OFFER,26-FEB-1987 15:07:13.72,BankAmerica Corp is not under\npressure to act...
4,NATIONAL AVERAGE PRICES FOR FARMER-OWNED RESERVE,26-FEB-1987 15:10:44.60,The U.S. Agriculture Department\nreported the ...


df['Date'] = pd.to_datetime(df['Date']) # If your Date column is of the type object otherwise skip this
date_range = str(df['Date'].dt.date.min()) + ' to ' +str(df['Date'].dt.date.max())

In [28]:
df['Text'][0]

'Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    There are doubts as to how much of this cocoa would 

In [29]:
df.shape

(19043, 3)

In [30]:
df.tail()

Unnamed: 0,Title,Date,Text
19038,JAPAN/INDIA CONFERENCE CUTS GULF WAR RISK CHARGES,19-OCT-1987 00:34:08.94,The Japan/India-Pakistan-Gulf/Japan\nshipping ...
19039,SOVIET INDUSTRIAL GROWTH/TRADE SLOWER IN 1987,19-OCT-1987 00:18:22.79,The Soviet Union's industrial output is\ngrowi...
19040,SIX KILLED IN SOUTH AFRICAN GOLD MINE ACCIDENT,19-OCT-1987 00:05:11.26,Six black miners have been killed\nand two inj...
19041,PROJECTIONS SHOW SWISS VOTERS WANT TRIED PARTIES,19-OCT-1987 00:03:21.69,The prospect of a dominant alliance of\nsocial...
19042,AMERICAN EXCHANGE INTRODUCES INSTITUTIONAL INDEX,19-OCT-1987 13:30:41.59,The American Stock Exchange said it has\nintro...


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19043 entries, 0 to 19042
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Title   19043 non-null  object
 1   Date    19043 non-null  object
 2   Text    19043 non-null  object
dtypes: object(3)
memory usage: 446.4+ KB


In [32]:
df.describe()  #unique is sufficiently less than the total count. Consider removing

Unnamed: 0,Title,Date,Text
count,19043,19043,19043
unique,18253,19043,18781
top,PROPOSED OFFERINGS RECENTLY FILED WITH THE SEC,26-FEB-1987 15:01:01.79,The Bundesbank left credit policies\nunchanged...
freq,25,1,4


### Preprocessing the data


In [33]:
# will merge the individual text columns together into a single column
#df['all_text'] = df['Title'] + df['Text']
#marking down this bc the result does not contain a space inbetween the text and title

In [34]:
#convert all words to lowercase since python is case sensitive
#df['all_text']= df['all_text'].str.lower()

In [35]:
df['Text']= df['Text'].str.lower()

In [36]:
df['Title']= df['Title'].str.lower()

In [37]:
df['Title'].head()

0                                  bahia cocoa review
1           standard oil <srd> to form financial unit
2          texas commerce bancshares <tcb> files plan
3        talking point/bankamerica <bac> equity offer
4    national average prices for farmer-owned reserve
Name: Title, dtype: object

In [38]:
df['Text'].head()

0    showers continued throughout the week in\nthe ...
1    standard oil co and bp north america\ninc said...
2    texas commerce bancshares inc's texas\ncommerc...
3    bankamerica corp is not under\npressure to act...
4    the u.s. agriculture department\nreported the ...
Name: Text, dtype: object

In [39]:
#check to see if all lowercase
#df['all_text']

In [40]:
df.head()

Unnamed: 0,Title,Date,Text
0,bahia cocoa review,26-FEB-1987 15:01:01.79,showers continued throughout the week in\nthe ...
1,standard oil <srd> to form financial unit,26-FEB-1987 15:02:20.00,standard oil co and bp north america\ninc said...
2,texas commerce bancshares <tcb> files plan,26-FEB-1987 15:03:27.51,texas commerce bancshares inc's texas\ncommerc...
3,talking point/bankamerica <bac> equity offer,26-FEB-1987 15:07:13.72,bankamerica corp is not under\npressure to act...
4,national average prices for farmer-owned reserve,26-FEB-1987 15:10:44.60,the u.s. agriculture department\nreported the ...


In [41]:
df['Text'][0]

'showers continued throughout the week in\nthe bahia cocoa zone, alleviating the drought since early\njanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\ncomissaria smith said in its weekly review.\n    the dry period means the temporao will be late this year.\n    arrivals for the week ended february 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    comissaria smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. with total bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    there are doubts as to how much of this cocoa would 

In [42]:
df['Title'] =df['Title'].replace('\n', ' ', regex=True)

In [43]:
#checked results and it is fine
df['Title'][0]

'bahia cocoa review'

In [44]:
#I noticed that the new line \n was still in the df, therefore need to remove
df['Text'] =df['Text'].replace('\n', ' ', regex=True)

In [45]:
#checked results and it is fine
df['Text'][0]

'showers continued throughout the week in the bahia cocoa zone, alleviating the drought since early january and improving prospects for the coming temporao, although normal humidity levels have not been restored, comissaria smith said in its weekly review.     the dry period means the temporao will be late this year.     arrivals for the week ended february 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. again it seems that cocoa delivered earlier on consignment was included in the arrivals figures.     comissaria smith said there is still some doubt as to how much old crop cocoa is still available as harvesting has practically come to an end. with total bahia crop estimates around 6.4 mln bags and sales standing at almost 6.2 mln there are a few hundred thousand bags still in the hands of farmers, middlemen, exporters and processors.     there are doubts as to how much of this cocoa would be fit for export

In [46]:
import string
punctuation = string.punctuation
df['Text'] = df['Text'].apply(lambda x: [word for word in x if word not in punctuation])


In [47]:
df['Title'] = df['Title'].apply(lambda x: [word for word in x if word not in punctuation])

In [48]:
df["Text"].head()

0    [s, h, o, w, e, r, s,  , c, o, n, t, i, n, u, ...
1    [s, t, a, n, d, a, r, d,  , o, i, l,  , c, o, ...
2    [t, e, x, a, s,  , c, o, m, m, e, r, c, e,  , ...
3    [b, a, n, k, a, m, e, r, i, c, a,  , c, o, r, ...
4    [t, h, e,  , u, s,  , a, g, r, i, c, u, l, t, ...
Name: Text, dtype: object

In [49]:
df["Title"].head()

0    [b, a, h, i, a,  , c, o, c, o, a,  , r, e, v, ...
1    [s, t, a, n, d, a, r, d,  , o, i, l,  , s, r, ...
2    [t, e, x, a, s,  , c, o, m, m, e, r, c, e,  , ...
3    [t, a, l, k, i, n, g,  , p, o, i, n, t, b, a, ...
4    [n, a, t, i, o, n, a, l,  , a, v, e, r, a, g, ...
Name: Title, dtype: object

In [50]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\faria\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [51]:
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize 

In [None]:
df['Text'].apply(word_tokenize)

In [None]:
df['Title'].apply(word_tokenize)

In [54]:
#this function will tokenize a panda col (which here I merged) and return lists of token
#df['tokenized'] = df.apply(lambda x: tokenize(x['Text']), axis=1)



In [78]:
# now remove stopwords-they add little value to the meaning and therfore reduces noise
nltk.download('stopwords');

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\faria\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [79]:
import nltk
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')
print(sw_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [80]:
def remove_stopwords(tokenized_column):
    """Return a list of tokens with English stopwords removed. 

    Args:
        column: Pandas dataframe column of tokenized data from tokenize()

    Returns:
        tokens (list): Tokenized list with stopwords removed.

    """
    stops = set(stopwords.words("english"))
    return [word for word in tokenized_column if not word in stops]

In [81]:
import pandas as pd
from textblob import TextBlob
import numpy as np
import os
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\faria\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [82]:
df['Text'] = df['Text'].apply(lambda x: [item for item in x.split() if item not in stop])

In [83]:
df['Title'] = df['Title'].apply(lambda x: [item for item in x.split() if item not in stop])

In [84]:
df['Title'][0]

['bahia', 'cocoa', 'review']

In [85]:
df['Text'][0]

['showers',
 'continued',
 'throughout',
 'week',
 'bahia',
 'cocoa',
 'zone,',
 'alleviating',
 'drought',
 'since',
 'early',
 'january',
 'improving',
 'prospects',
 'coming',
 'temporao,',
 'although',
 'normal',
 'humidity',
 'levels',
 'restored,',
 'comissaria',
 'smith',
 'said',
 'weekly',
 'review.',
 'dry',
 'period',
 'means',
 'temporao',
 'late',
 'year.',
 'arrivals',
 'week',
 'ended',
 'february',
 '22',
 '155,221',
 'bags',
 '60',
 'kilos',
 'making',
 'cumulative',
 'total',
 'season',
 '5.93',
 'mln',
 '5.81',
 'stage',
 'last',
 'year.',
 'seems',
 'cocoa',
 'delivered',
 'earlier',
 'consignment',
 'included',
 'arrivals',
 'figures.',
 'comissaria',
 'smith',
 'said',
 'still',
 'doubt',
 'much',
 'old',
 'crop',
 'cocoa',
 'still',
 'available',
 'harvesting',
 'practically',
 'come',
 'end.',
 'total',
 'bahia',
 'crop',
 'estimates',
 'around',
 '6.4',
 'mln',
 'bags',
 'sales',
 'standing',
 'almost',
 '6.2',
 'mln',
 'hundred',
 'thousand',
 'bags',
 'still'

In [86]:
import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

In [87]:
#now i need to stem or lemmatize the strings- chose lemmatize because it is more accurate and
#using stem produced a stem of the word financial that excluded it from my like words of money
lmtzr = WordNetLemmatizer()
df['Lem_text'] = df['Text'].apply(
                    lambda lst:[lmtzr.lemmatize(word) for word in lst])

In [88]:
df['Lem_title'] = df['Title'].apply(
                    lambda lst:[lmtzr.lemmatize(word) for word in lst])

In [89]:
#appears correct
df['Lem_text'][0]

['shower',
 'continued',
 'throughout',
 'week',
 'bahia',
 'cocoa',
 'zone,',
 'alleviating',
 'drought',
 'since',
 'early',
 'january',
 'improving',
 'prospect',
 'coming',
 'temporao,',
 'although',
 'normal',
 'humidity',
 'level',
 'restored,',
 'comissaria',
 'smith',
 'said',
 'weekly',
 'review.',
 'dry',
 'period',
 'mean',
 'temporao',
 'late',
 'year.',
 'arrival',
 'week',
 'ended',
 'february',
 '22',
 '155,221',
 'bag',
 '60',
 'kilo',
 'making',
 'cumulative',
 'total',
 'season',
 '5.93',
 'mln',
 '5.81',
 'stage',
 'last',
 'year.',
 'seems',
 'cocoa',
 'delivered',
 'earlier',
 'consignment',
 'included',
 'arrival',
 'figures.',
 'comissaria',
 'smith',
 'said',
 'still',
 'doubt',
 'much',
 'old',
 'crop',
 'cocoa',
 'still',
 'available',
 'harvesting',
 'practically',
 'come',
 'end.',
 'total',
 'bahia',
 'crop',
 'estimate',
 'around',
 '6.4',
 'mln',
 'bag',
 'sale',
 'standing',
 'almost',
 '6.2',
 'mln',
 'hundred',
 'thousand',
 'bag',
 'still',
 'hand',
 

In [90]:
data = df[['Lem_title','Date', 'Lem_text']]

In [91]:
data.head()

Unnamed: 0,Lem_title,Date,Lem_text
0,"[bahia, cocoa, review]",26-FEB-1987 15:01:01.79,"[shower, continued, throughout, week, bahia, c..."
1,"[standard, oil, <srd>, form, financial, unit]",26-FEB-1987 15:02:20.00,"[standard, oil, co, bp, north, america, inc, s..."
2,"[texas, commerce, bancshares, <tcb>, file, plan]",26-FEB-1987 15:03:27.51,"[texas, commerce, bancshares, inc's, texas, co..."
3,"[talking, point/bankamerica, <bac>, equity, of...",26-FEB-1987 15:07:13.72,"[bankamerica, corp, pressure, act, quickly, pr..."
4,"[national, average, price, farmer-owned, reserve]",26-FEB-1987 15:10:44.60,"[u.s., agriculture, department, reported, farm..."


df_money.head()

filtered_df =df[df['lemmatize_text'].str.contains('money', regex=False)]

In [163]:
#initially I started with the word money, but after applying synset (see below I realized that fund was the most inclusive)
#filt_df = data[data['Lem_title'].str.contains('money', regex=False)]

In [164]:
filt_df.shape  # I want to first see the titles with money

(224, 3)

In [172]:
filt_df = data[data['Lem_title'].str.contains('fund', regex=False)]

In [173]:
filt_df.shape #more titles exist with the word money, here I ran it with fund

(106, 3)

In [174]:
filt_df = data[data['Lem_title'].str.contains('fund', regex=False)]

In [175]:
filt_df = data[data['Lem_text'].str.contains('fund', regex=False)]

In [176]:
filt_df.head()

Unnamed: 0,Lem_title,Date,Lem_text
47,"[america, first, mortgage, set, special, payout]",26-FEB-1987 15:52:33.04,"[<america, first, federally, guaranteed, mortg..."
55,"[asset, u.s., money, fund, rose, week]",26-FEB-1987 15:58:19.46,"[asset, money, market, mutual, fund, increased..."
78,"[columbia, gas, system, inc, <cg>, redeems, de...",26-FEB-1987 16:25:42.65,"[columbia, gas, system, inc, said, redeem, 4.7..."
96,"[u.s., bank, discount, borrowing, 310, mln, dlrs]",26-FEB-1987 16:41:34.44,"[u.s., bank, discount, window, borrowing, le, ..."
100,"[liberty, all-star, <usa>, set, initial, payout]",26-FEB-1987 16:45:44.50,"[liberty, all-star, equity, fund, said, declar..."


In [177]:
filt_df.shape  #by changing the word from math to fund, I increased the rows by over 300

(1028, 3)

In [179]:
filt_df['Lem_text'][96]  #testing

['u.s.',
 'bank',
 'discount',
 'window',
 'borrowing',
 'le',
 'extended',
 'credit',
 'averaged',
 '310',
 'mln',
 'dlrs',
 'week',
 'wednesday',
 'february',
 '25,',
 'federal',
 'reserve',
 'said.',
 'fed',
 'said',
 'overall',
 'borrowing',
 'week',
 'fell',
 '131',
 'mln',
 'dlrs',
 '614',
 'mln',
 'dlrs,',
 'extended',
 'credit',
 '10',
 'mln',
 'dlrs',
 '304',
 'mln',
 'dlrs.',
 'week',
 'second',
 'half',
 'two-week',
 'statement',
 'period.',
 'net',
 'borrowing',
 'prior',
 'week',
 'averaged',
 '451',
 'mln',
 'dlrs.',
 'commenting',
 'two-week',
 'statement',
 'period',
 'ended',
 'february',
 '25,',
 'fed',
 'said',
 'bank',
 'average',
 'net',
 'free',
 'reserve',
 '644',
 'mln',
 'dlrs',
 'day,',
 '1.34',
 'billion',
 'two',
 'week',
 'earlier.',
 'federal',
 'reserve',
 'spokesman',
 'told',
 'press',
 'briefing',
 'large',
 'single',
 'day',
 'net',
 'miss',
 "fed's",
 'reserve',
 'projection',
 'week',
 'wednesday.',
 'said',
 'natural',
 'float',
 '"acting',
 'bit',

In [180]:
filt_df['Lem_title'][96]  #notice that the word fund was not in the title, yet it correctly grabbed it

['u.s.', 'bank', 'discount', 'borrowing', '310', 'mln', 'dlrs']

In [181]:
filt_df2 = data[data['Lem_title'].str.contains('finance', regex=False)] 

In [182]:
filt_df2 = data[data['Lem_text'].str.contains('finance', regex=False)]

In [183]:
filt_df2.head()

Unnamed: 0,Lem_title,Date,Lem_text
6,"[red, lion, inn, file, plan, offering]",26-FEB-1987 15:14:42.83,"[red, lion, inn, limited, partnership, said, f..."
25,"[banker, report, breakthrough, venezuelan, debt]",26-FEB-1987 15:36:44.78,"[venezuela, bank, advisory, committee, agreed,..."
58,"[key, u.s., tax, writer, seek, estate, tax, curb]",26-FEB-1987 16:04:05.90,"[chairman, senior, republican, member, house, ..."
60,"[canada's, wilson, seek, temporary, borrowing]",26-FEB-1987 16:05:36.23,"[canadian, finance, minister, michael, wilson,..."
121,"[venezuela, seek, 'flexibility', banks-azpurua]",26-FEB-1987 17:11:26.97,"[venezuela, seeking, 'constructive, flexible',..."


In [185]:
filt_df2.shape  #quite a lot with finance, but when using the word money was not a lot of crossover bizarrely


(1239, 3)

In [186]:
#filt_finance =(filt_df[filt_df['Lem_text'].str.contains('finance', regex=False)])

In [103]:
#Filt_finance =(filt_df[filt_df['Lem_title'].str.contains('finance', regex=False)])

In [104]:
#filt_finance.head()

Unnamed: 0,Lem_title,Date,Lem_text
337,"[funaro, say, brazil, need, more,, faster, fin...",2-MAR-1987 08:15:10.94,"[brazil, would, suspended, payment, debt, owed..."
4879,"[u.s., corporate, finance, -, bank, paper, pre...",16-MAR-1987 09:02:32.84,"[debt, security, issued, major, u.s., bank, pr..."
5917,"[french, finance, group, sdr, issue, domestic,...",18-MAR-1987 09:17:31.23,"[french, regional, financing, group, <societes..."
15258,"[ec, minister, likely, criticise, finance, idea]",26-APR-1987 06:44:58.79,"[plan, new-style, european, community, (ec), f..."


In [105]:
#filt_finance.shape  #this tell me that very few articles regarding money have to do with the subject of finance, which seems strange

(4, 3)

In [187]:
filt_finance =(filt_df[filt_df['Lem_text'].str.contains('finance', regex=False)])

In [188]:
filt_finance =(filt_df[filt_df['Lem_title'].str.contains('finance', regex=False)])

In [189]:
filt_finance.shape  #more than twice more

(9, 3)

In [106]:
filt_fund =(filt_df2[filt_df2['Lem_text'].str.contains('fund', regex=False)])

In [190]:
filt_fund =(filt_df2[filt_df2['Lem_text'].str.contains('fund', regex=False)])

In [191]:
filt_money.head()

Unnamed: 0,Lem_title,Date,Lem_text
164,"[u.s., treasury, part, argentine, bridge, loan]",26-FEB-1987 18:07:18.31,"[u.s., treasury, said, willing, participate, s..."
328,"[manila, official, split, debt, strategy]",2-MAR-1987 07:39:34.16,"[rift, occured, among, philippine, official, d..."
337,"[funaro, say, brazil, need, more,, faster, fin...",2-MAR-1987 08:15:10.94,"[brazil, would, suspended, payment, debt, owed..."
824,"[japan's, unemployment, rate, seen, rising, 3....",3-MAR-1987 01:39:16.42,"[japan's, unemployment, rate, expected, contin..."
847,"[japan, likely, let, u.s., bank, deal, security]",3-MAR-1987 05:05:45.88,"[japan, look, likely, allow, u.s., bank, condu..."


In [109]:
#filt_money.shape  # this tells me that there exists articles that are about finance and contain the lem words of money

(130, 3)

In [192]:
filt_fund.shape  #more than double also and therefore fund is the better word to filter articles about money

(244, 3)

### SYNSET

In [195]:
from nltk.corpus import wordnet as wn
for ss in wn.synsets('money'):
    print(ss, ss.hypernyms())


Synset('money.n.01') [Synset('medium_of_exchange.n.01')]
Synset('money.n.02') [Synset('wealth.n.04')]
Synset('money.n.03') [Synset('currency.n.01')]


In [196]:
from nltk.corpus import wordnet as wn
for ss in wn.synsets('fund'):
    print(ss, ss.hypernyms())

Synset('fund.n.01') [Synset('money.n.01')]
Synset('store.n.02') [Synset('accumulation.n.04')]
Synset('investment_company.n.01') [Synset('nondepository_financial_institution.n.01')]
Synset('fund.v.01') [Synset('finance.v.01')]
Synset('fund.v.02') [Synset('roll_up.v.02')]
Synset('fund.v.03') [Synset('supply.v.01')]
Synset('fund.v.04') [Synset('invest.v.01')]
Synset('fund.v.05') [Synset('roll_up.v.02')]
Synset('fund.v.06') [Synset('support.v.02')]


In [197]:
 print(wn.synset('money.n.01').definition())

the most common medium of exchange; functions as legal tender


In [198]:
 print(wn.synset('fund.n.01').definition())

a reserve of money set aside for some purpose


In [199]:
 print(wn.synset('finance.n.01').definition())

the commercial activity of providing funds and capital


In [200]:
wn.synset('money.n.01').lemmas()

[Lemma('money.n.01.money')]

In [202]:
wn.synset('fund.n.01').lemmas()

[Lemma('fund.n.01.fund'), Lemma('fund.n.01.monetary_fund')]

In [203]:
 [str(lemma.name()) for lemma in wn.synset('funds.n.01').lemmas()]

['funds',
 'finances',
 'monetary_resource',
 'cash_in_hand',
 'pecuniary_resource']

In [205]:
# funds is more valuable so therefore I will change money to fund  
[str(lemma.name()) for lemma in wn.synset('money.n.01').lemmas()]  #I can see that that funds is much more inclusive

['money']

In [206]:
wn.synset('finance.n.01').lemmas()

[Lemma('finance.n.01.finance')]

In [207]:
import nltk
from nltk.corpus import wordnet     #Import wordnet from the NLTK
first_word = wordnet.synset("money.n.01")
second_word = wordnet.synset("finance.v.01")
print('Similarity to finance as a verb: ' + str(first_word.wup_similarity(second_word)))
first_word = wordnet.synset("money.n.02")
second_word = wordnet.synset("finance.n.01")
print('Similarity to finance as a noun an: ' + str(first_word.wup_similarity(second_word)))


Similarity to finance as a verb: 0.15384615384615385
Similarity to finance as a noun an: 0.25


In [212]:
first_word = wordnet.synset("fund.n.01")
second_word = wordnet.synset("finance.v.01")
print('Similarity to finance as a verb: ' + str(first_word.wup_similarity(second_word)))
first_word = wordnet.synset("fund.n.02")
second_word = wordnet.synset("finance.n.01")
print('Similarity to finance as a noun an: ' + str(first_word.wup_similarity(second_word)))

Similarity to finance as a verb: 0.14285714285714285
Similarity to finance as a noun an: 0.19047619047619047


In [215]:
import nltk
from nltk.corpus import wordnet     #Import wordnet from the NLTK
first_word = wordnet.synset("fund.n.01")
second_word = wordnet.synset("money.n.01")
print('Similarity to between fund and money: ' + str(first_word.wup_similarity(second_word)))


Similarity to between fund and money: 0.9333333333333333


## Hypernyms and Hyponyms –

Hypernyms: More abstract terms
Hyponyms: More specific terms.

Both come to picture as Synsets are organized in a structure similar to that of an inheritance tree. This tree can be traced all the way up to a root hypernym. Hypernyms provide a way to categorize and group words based on their similarity to each other.

In [216]:
from nltk.corpus import wordnet
syn = wordnet.synsets('money')[0]
  
print ("Synset name :  ", syn.name())
  
print ("\nSynset abstract term :  ", syn.hypernyms())
  
print ("\nSynset specific term :  ", 
       syn.hypernyms()[0].hyponyms())
  
syn.root_hypernyms()
  
print ("\nSynset root hypernerm :  ", syn.root_hypernyms())

Synset name :   money.n.01

Synset abstract term :   [Synset('medium_of_exchange.n.01')]

Synset specific term :   [Synset('currency.n.01'), Synset('money.n.01'), Synset('tender.n.01')]

Synset root hypernerm :   [Synset('entity.n.01')]


In [218]:
#corfirm using wup

first_word.wup_similarity(second_word)

0.9333333333333333

In [219]:
sorted(first_word.common_hypernyms(second_word))

[Synset('abstraction.n.06'),
 Synset('entity.n.01'),
 Synset('measure.n.02'),
 Synset('medium_of_exchange.n.01'),
 Synset('money.n.01'),
 Synset('standard.n.01'),
 Synset('system_of_measurement.n.01')]

In [224]:
from nltk.corpus import wordnet as wn

relative = wn.synsets('fund', 'n')[0]
hypos = lambda s:s.hyponyms()

print(list(relative.closure(hypos)))

[Synset('budget.n.01'), Synset('deposit.n.04'), Synset('mutual_fund.n.01'), Synset('pension_fund.n.01'), Synset('petty_cash.n.01'), Synset('revolving_fund.n.01'), Synset('savings.n.01'), Synset('sinking_fund.n.01'), Synset('slush_fund.n.01'), Synset('trust_fund.n.01'), Synset('war_chest.n.01'), Synset('civil_list.n.01'), Synset('operating_budget.n.01'), Synset('demand_deposit.n.01'), Synset('exchange_traded_fund.n.01'), Synset('index_fund.n.01')]


In [225]:
from nltk.corpus import wordnet as wn

relative = wn.synsets('fund', 'n')[0]
hypers = lambda s:s.hypernyms()

print(list(relative.closure(hypers)))

[Synset('money.n.01'), Synset('medium_of_exchange.n.01'), Synset('standard.n.01'), Synset('system_of_measurement.n.01'), Synset('measure.n.02'), Synset('abstraction.n.06'), Synset('entity.n.01')]


In [226]:
money = wn.synset('fund.n.01')  #first look up the first entry for money
print(money)

Synset('fund.n.01')


### from nltk.corpus import wordnet as wn
import networkx as nx

def closure_graph(synset, fn):
    seen = set()
    graph = nx.DiGraph()

    def recurse(s):
        if not s in seen:
            seen.add(s)
            graph.add_node(s.name)
            for s1 in fn(s):
                graph.add_node(s1.name)
                graph.add_edge(s.name, s1.name)
                recurse(s1)

    recurse(synset)
    return graph

In [122]:
graph = closure_graph(money,
                      lambda s: s.hypernyms())

In [123]:
print(graph)

DiGraph with 7 nodes and 6 edges


In [138]:
sorted(wn.synset('money.n.01').lemmas('spa'))

[Lemma('money.n.01.dinero')]

In [139]:
filt_df3 = data[data['Lem_text'].str.contains('dinero', regex=False)]

In [140]:
filt_df3.head(0)

Unnamed: 0,Lem_title,Date,Lem_text
