In [301]:
# Import all of the things you need to import!
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
import re

pd.options.display.max_columns = 30
%matplotlib inline

# Homework 14 (or so): TF-IDF text analysis and clustering

Hooray, we kind of figured out how text analysis works! Some of it is still magic, but at least the **TF** and **IDF** parts make a little sense. Kind of. Somewhat.

No, just kidding, we're *professionals* now.

## Investigating the Congressional Record

The [Congressional Record](https://en.wikipedia.org/wiki/Congressional_Record) is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?

Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from [this page here](http://www.cs.cornell.edu/home/llee/data/convote.html).

In [302]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9607k  100 9607k    0     0   743k      0  0:00:12  0:00:12 --:--:--  443k


In [303]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz

You can explore the files if you'd like, but we're going to get the ones from `convote_v1.1/data_stage_one/development_set/`. It's a bunch of text files.

In [304]:
# glob finds files matching a certain filename pattern
import glob

# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]

['convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327025_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327044_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327046_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_1479036_DON.txt']

In [305]:
len(paths)

702

So great, we have 702 of them. Now let's import them.

In [306]:
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()

Unnamed: 0,content,filename,pathname
0,"mr. chairman , i thank the gentlewoman for yie...",052_400011_0327014_DON.txt,convote_v1.1/data_stage_one/development_set/05...
1,"mr. chairman , i want to thank my good friend ...",052_400011_0327025_DON.txt,convote_v1.1/data_stage_one/development_set/05...
2,"mr. chairman , i rise to make two fundamental ...",052_400011_0327044_DON.txt,convote_v1.1/data_stage_one/development_set/05...
3,"mr. chairman , reclaiming my time , let me mak...",052_400011_0327046_DON.txt,convote_v1.1/data_stage_one/development_set/05...
4,"mr. chairman , i thank my distinguished collea...",052_400011_1479036_DON.txt,convote_v1.1/data_stage_one/development_set/05...


In [307]:
speeches_df

Unnamed: 0,content,filename,pathname
0,"mr. chairman , i thank the gentlewoman for yie...",052_400011_0327014_DON.txt,convote_v1.1/data_stage_one/development_set/05...
1,"mr. chairman , i want to thank my good friend ...",052_400011_0327025_DON.txt,convote_v1.1/data_stage_one/development_set/05...
2,"mr. chairman , i rise to make two fundamental ...",052_400011_0327044_DON.txt,convote_v1.1/data_stage_one/development_set/05...
3,"mr. chairman , reclaiming my time , let me mak...",052_400011_0327046_DON.txt,convote_v1.1/data_stage_one/development_set/05...
4,"mr. chairman , i thank my distinguished collea...",052_400011_1479036_DON.txt,convote_v1.1/data_stage_one/development_set/05...
5,i yield to the gentleman from illinois . \n,052_400011_1479038_DON.txt,convote_v1.1/data_stage_one/development_set/05...
6,"mr. chairman , reclaiming my time , the fact i...",052_400011_1479040_DON.txt,convote_v1.1/data_stage_one/development_set/05...
7,i yield to the gentleman from illinois . \n,052_400011_1479042_DON.txt,convote_v1.1/data_stage_one/development_set/05...
8,"mr. chairman , reclaiming my time , i would be...",052_400011_1479044_DON.txt,convote_v1.1/data_stage_one/development_set/05...
9,"mr. chairman , i do not have it on the top of ...",052_400011_1479046_DON.txt,convote_v1.1/data_stage_one/development_set/05...


In class we had the `texts` variable. For the homework can just do `speeches_df['content']` to get the same sort of list of stuff.

**Take a look at the contents of the first 5 speeches**

In [308]:
contents =speeches_df['content']
contents[:5]

0    mr. chairman , i thank the gentlewoman for yie...
1    mr. chairman , i want to thank my good friend ...
2    mr. chairman , i rise to make two fundamental ...
3    mr. chairman , reclaiming my time , let me mak...
4    mr. chairman , i thank my distinguished collea...
Name: content, dtype: object

# Doing our analysis

Use the `sklearn` package and a plain boring `CountVectorizer` to get a list of all of the tokens used in the speeches. If it won't list them all, that's ok! Make a dataframe with those terms as columns.

**Be sure to include English-language stopwords**

In [314]:
porter_stemmer = PorterStemmer()

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

count_vectorizer = CountVectorizer(stop_words='english', tokenizer=stemming_tokenizer)
Xc = count_vectorizer.fit_transform(contents)
print(count_vectorizer.get_feature_names())


[u'-', u'--', u'-central', u'-china', u'-women', u'0', u'000', u'018', u'050', u'092', u'1', u'1-minut', u'1-year', u'10', u'10-year', u'100', u'106', u'106-286', u'107-16', u'108-27', u'108th', u'109th', u'10th', u'11', u'11-octob', u'11-style', u'110', u'114', u'117', u'118', u'11th', u'12', u'12-year-old', u'120', u'121', u'122', u'123', u'125', u'128', u'12898', u'13', u'13279', u'1332', u'1335', u'1344', u'135', u'138', u'14', u'140', u'143', u'144', u'145', u'149', u'1498', u'14th', u'15', u'15-minut', u'150', u'153', u'155', u'159', u'16', u'160', u'162', u'163', u'165', u'1671', u'1675', u'17', u'170', u'174', u'178', u'1787', u'17th', u'18', u'180', u'1800', u'181', u'1812', u'1855', u'186', u'1868', u'18th', u'19', u'190', u'1907', u'1922', u'1927', u'1930', u'1940', u'1950', u'196', u'1960', u'1964', u'1965', u'1967', u'1970', u'1971', u'1972', u'1973', u'1974', u'1976', u'1979', u'198', u'1980', u'1981', u'1982', u'1983', u'1984', u'1985', u'1986', u'1987', u'1988', u'1989'

In [315]:
original=pd.DataFrame(Xc.toarray(), columns=count_vectorizer.get_feature_names())
for i in original.columns:
    print i

-
--
-central
-china
-women
0
000
018
050
092
1
1-minut
1-year
10
10-year
100
106
106-286
107-16
108-27
108th
109th
10th
11
11-octob
11-style
110
114
117
118
11th
12
12-year-old
120
121
122
123
125
128
12898
13
13279
1332
1335
1344
135
138
14
140
143
144
145
149
1498
14th
15
15-minut
150
153
155
159
16
160
162
163
165
1671
1675
17
170
174
178
1787
17th
18
180
1800
181
1812
1855
186
1868
18th
19
190
1907
1922
1927
1930
1940
1950
196
1960
1964
1965
1967
1970
1971
1972
1973
1974
1976
1979
198
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
19th
1st
2
2-
2-day
2-year
20
20-80
20-point
200
200-year
2000
2001
2002
2003
2004
2004-p-00007
2005
2006
2007
2008
2011
2016
202
202-234-8494
202-639-6370
202-675-2324
2072-74
20th
21
21-day
2123
2132
214
216-1520
21st
22
220
2210
2217
222
223
225
226
229
23
23-24
231
234
2361
23rd
24
24-7
240
241
2411
242
2451
248
25
250
2586
26
261
263
2646
26th
27
270
273-3000
275
278
279
28
283
2844
287
2882
2884


Okay, it's **far** too big to even look at. Let's try to get a list of features from a new `CountVectorizer` that only takes the top 100 words.

In [316]:
def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False,norm=)
Xc = tfidf_vectorizer.fit_transform(contents)
frequency=pd.DataFrame(Xc.toarray(), columns=tfidf_vectorizer.get_feature_names())

SyntaxError: invalid syntax (<ipython-input-316-d3b00a74eb5f>, line 6)

In [300]:
for i in frequency.columns:
    print i

-
--
-central
-china
-women
0
000
018
050
092
1
1-minut
1-year
10
10-year
100
106
106-286
107-16
108-27
108th
109th
10th
11
11-octob
11-style
110
114
117
118
11th
12
12-year-old
120
121
122
123
125
128
12898
13
13279
1332
1335
1344
135
138
14
140
143
144
145
149
1498
14th
15
15-minut
150
153
155
159
16
160
162
163
165
1671
1675
17
170
174
178
1787
17th
18
180
1800
181
1812
1855
186
1868
18th
19
190
1907
1922
1927
1930
1940
1950
196
1960
1964
1965
1967
1970
1971
1972
1973
1974
1976
1979
198
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
19th
1st
2
2-
2-day
2-year
20
20-80
20-point
200
200-year
2000
2001
2002
2003
2004
2004-p-00007
2005
2006
2007
2008
2011
2016
202
202-234-8494
202-639-6370
202-675-2324
2072-74
20th
21
21-day
2123
2132
214
216-1520
21st
22
220
2210
2217
222
223
225
226
229
23
23-24
231
234
2361
23rd
24
24-7
240
241
2411
242
2451
248
25
250
2586
26
261
263
2646
26th
27
270
273-3000
275
278
279
28
283
2844
287
2882
2884


In [317]:
frequncy_sum=frequency.sum()


In [318]:
frequncy_sum.sort(ascending = True)

  if __name__ == '__main__':


In [319]:
top_100=frequncy_sum[-100:].index
top_100


Index([u'tri', u'discrimin', u'protect', u'come', u'believ', u'ms', u'day',
       u'urg', u'help', u'let', u'distinguish', u'hi', u'h', u'mani',
       u'million', u'3', u'organ', u'point', u'nation', u'opposit', u'madam',
       u'appropri', u'onli', u'servic', u'good', u'religi', u'continu',
       u'order', u'like', u'law', u'debat', u'demand', u'educ', u'record',
       u'california', u'import', u'new', u'use', u'issu', u'just', u'doe',
       u'today', u'way', u'act', u'feder', u'elect', u'know', u'consum',
       u'ani', u'american', u'fund', u'say', u'ask', u'offer', u'colleagu',
       u'congress', u'china', u'think', u'allow', u'gentlewoman', u'thank',
       u'provid', u'work', u'want', u'reserv', u'peopl', u'1', u'trade',
       u'year', u'becaus', u'need', u'children', u'right', u'make', u'legisl',
       u'support', u'veri', u'hous', u'2', u'rule', u'state', u'program',
       u'member', u'wa', u'committe', u'ha', u'head', u'vote', u'start', u's',
       u'minut', u'balan

In [320]:
p=[]
for i in top_100:
    p.append(i)
    
p


[u'tri',
 u'discrimin',
 u'protect',
 u'come',
 u'believ',
 u'ms',
 u'day',
 u'urg',
 u'help',
 u'let',
 u'distinguish',
 u'hi',
 u'h',
 u'mani',
 u'million',
 u'3',
 u'organ',
 u'point',
 u'nation',
 u'opposit',
 u'madam',
 u'appropri',
 u'onli',
 u'servic',
 u'good',
 u'religi',
 u'continu',
 u'order',
 u'like',
 u'law',
 u'debat',
 u'demand',
 u'educ',
 u'record',
 u'california',
 u'import',
 u'new',
 u'use',
 u'issu',
 u'just',
 u'doe',
 u'today',
 u'way',
 u'act',
 u'feder',
 u'elect',
 u'know',
 u'consum',
 u'ani',
 u'american',
 u'fund',
 u'say',
 u'ask',
 u'offer',
 u'colleagu',
 u'congress',
 u'china',
 u'think',
 u'allow',
 u'gentlewoman',
 u'thank',
 u'provid',
 u'work',
 u'want',
 u'reserv',
 u'peopl',
 u'1',
 u'trade',
 u'year',
 u'becaus',
 u'need',
 u'children',
 u'right',
 u'make',
 u'legisl',
 u'support',
 u'veri',
 u'hous',
 u'2',
 u'rule',
 u'state',
 u'program',
 u'member',
 u'wa',
 u'committe',
 u'ha',
 u'head',
 u'vote',
 u'start',
 u's',
 u'minut',
 u'balanc',
 u

In [322]:
top100frame=original[[u'florida', u'appropri', u'good', u'urg', u'like', u'remain', u'elect',
       u'consent', u'import', u'debat', u'hi', u'rise', u'question',
       u'michigan', u'today', u'feder', u'act', u'unanim', u'use', u'ohio',
       u'issu', u'oppos', u'american', u'fund', u'just', u'doe', u'educ',
       u'congress', u'china', u'way', u'subcommitte', u'point', u'continu',
       u'pleas', u'new', u'say', u'illinoi', u'claim', u'present', u'allow',
       u'provid', u'order', u'peopl', u'distinguish', u'ms', u'3', u'ye',
       u'colleagu', u'ani', u'year', u'children', u'think', u'want', u'need',
       u'work', u'becaus', u'wisconsin', u'trade', u'madam', u'legisl',
       u'make', u'support', u'consum', u'opposit', u'state', u'right', u'hous',
       u'thank', u'program', u'know', u'california', u'ask', u'texa', u'veri',
       u'offer', u'rule', u'head', u'member', u'ha', u'1', u'start', u'record',
       u'wa', u'demand', u'gentlewoman', u's', u'committe', u'2', u'reserv',
       u'vote', u'minut', u'amend', u'balanc', u'thi', u'speaker', u'time',
       u'gentleman', u'chairman', u'yield', u'mr']]

Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and many don't mention "chairman" and how many mention neither "mr" nor "chairman"?

In [336]:
chairman=top100frame[[u'chairman']]

In [342]:
chairman[chairman==0].count()
#250 speeches don't mention "chairman"

chairman    250
dtype: int64

In [347]:
mrchairman=top100frame[[u'chairman',u'mr']]
mrchairman['sum'] = mrchairman['chairman'] + mrchairman['mr']
mrchairman

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,chairman,mr,sum
0,3,2,5
1,2,4,6
2,2,3,5
3,2,3,5
4,1,2,3
5,0,0,0
6,1,1,2
7,0,0,0
8,1,1,2
9,2,1,3


In [350]:
mrchairman[mrchairman['sum']==0].count()
#75 speeches don't mention "chairman" nor "mr"

chairman    75
mr          75
sum         75
dtype: int64

What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?

In [361]:
top100frame[['thank']][top100frame[['thank']]['thank']>1]

Unnamed: 0,thank
33,3
57,4
117,2
124,3
158,4
165,3
169,2
187,2
200,2
201,3


If I'm searching for `China` and `trade`, what are the top 3 speeches to read according to the `CountVectoriser`?

In [376]:
chinatrade=top100frame[['china','trade']][top100frame[['china','trade']]['china']!=0]
chinatrade['sum']=chinatrade['china']+chinatrade['trade']
chinatrade.sort(['sum'],ascending = False)
# so no.379,399 & 345 speechese are the top ones to read

  app.launch_new_instance()


Unnamed: 0,china,trade,sum
379,28,64,92
399,27,10,37
345,16,11,27
367,5,22,27
317,13,14,27
351,12,11,23
324,8,15,23
402,10,12,22
370,7,15,22
400,16,6,22


Now what if I'm using a `TfidfVectorizer`?

**What's the content of the speeches?** Here's a way to get them:

In [266]:
# index 0 is the first speech, which was the first one imported.
paths[0]

'convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt'

In [267]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
!cat {paths[0]}

mr. chairman , i thank the gentlewoman for yielding me this time . 
my good colleague from california raised the exact and critical point . 
the question is , what happens during those 45 days ? 
we will need to support elections . 
there is not a single member of this house who has not supported some form of general election , a special election , to replace the members at some point . 
but during that 45 days , what happens ? 
the chair of the constitution subcommittee says this is what happens : martial law . 
we do not know who would fill the vacancy of the presidency , but we do know that the succession act most likely suggests it would be an unelected person . 
the sponsors of the bill before us today insist , and i think rightfully so , on the importance of elections . 
but to then say that during a 45-day period we would have none of the checks and balances so fundamental to our constitution , none of the separation of powers , and that the presidency would be filled b

**Now search for something else!** Another two terms that might show up. `elections` and `chaos`? Whatever you thnik might be interesting.

# Enough of this garbage, let's cluster

Using a **simple counting vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency inverse document frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

**Which one do you think works the best?**

# Harry Potter time

I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.

I want you to read them in, vectorize them and cluster them. Use this process to find out **the two types of Harry Potter fanfiction**. What is your hypothesis?