# Hist 3368 - Generating a KWIC for a member of Congress and a keyword

#### By Jo Guldi, with code borrowed from the Programming Historian

#### Import software packages and define helper functions

Most Python scripts will being with a series of commands to "install" and "import" software packages.

Software packages are collections of software instructions, held in the cloud, which users like us borrow as a shortcut so that we don't have to write all the code from scratch.

For this exercise, we'll be loading two of the most common software packages -- 'pandas,' which is used to read tabular data (where the data is stored with columns and rows 

In [2]:
import pandas as pd
import csv

### Read in the data

First, we want to tell the computer to navigate to the course folder, where the data for this course lives.

The command "cd" tells Python to "change directory."  "cd" is followed by a space and the name of a folder.

In [28]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


Next, we tell Python to load the data from a file in the directory.  The name of the file is "congress-just-1967-2010.csv".  "CSV" is a comma separated value file -- the most common type of file for data analysis; it usually holds data organized in columsns and rows.  The command for loading the data is pd.read_csv().

In [23]:
congress = pd.read_csv("congress-just-1970-2010.csv")

Let's look at the data.  Run the next cell to inspect our data.

In [45]:
congress

Unnamed: 0,speech,chamber,date,speaker,first_name,last_name,state,year,5yrperiod,index
5369522,The second session of the 91st Congress will n...,S,1970-01-19,The PRESIDENT pro tempore,Unknown,Unknown,Unknown,1970,1970.0,5369522
5369523,Mr. President. I suggest the absence of a quorum.,S,1970-01-19,Mr. MANSFIELD,Unknown,MANSFIELD,Unknown,1970,1970.0,5369523
5369524,The clerk will call the roll.,S,1970-01-19,The PRESIDENT pro tempore,Unknown,Unknown,Unknown,1970,1970.0,5369524
5369525,I announce that the Senator from Connecticut ....,S,1970-01-19,Mr. KENNEDY,Unknown,KENNEDY,Unknown,1970,1970.0,5369525
5369526,I announce that the Senator from Colorado . th...,S,1970-01-19,Mr. GRIFFIN,Unknown,GRIFFIN,Unknown,1970,1970.0,5369526
...,...,...,...,...,...,...,...,...,...,...
10876956,Madam Speaker. on rollcall Nos. 662 and 661. I...,E,2010-12-22,Ms. GRANGER,Unknown,GRANGER,Unknown,2010,2010.0,10876956
10876957,Madam Speaker. as I leave Congress as the peop...,E,2010-12-22,Ms. KILPATRICK of Michigan,Unknown,KILPATRICK,Michigan,2010,2010.0,10876957
10876958,Madam Speaker. on rolicall No. 658. I was unav...,E,2010-12-22,Mr. HELLER,Unknown,HELLER,Unknown,2010,2010.0,10876958
10876959,Madam Speaker. on rollcall No. 658 my flight w...,E,2010-12-22,Mr. PAULSEN,Unknown,PAULSEN,Unknown,2010,2010.0,10876959


### Look for one speaker

Next, we're going to 'filter' congress just for one speaker. 

In [43]:
congress = all_data[all_data['year']>=1970]

In [5]:
one_speaker = congress[congress['speaker'] == "Mr. STEVENS"]

You can change the word in the quotation marks to search for another speaker, but whatever data you input should match *exactly* the format of the data in the database.

For our purposes, the speaker should be one of the following: 'Mr. STEVENS', 'Mr. ROHRABACHER', 'Mr. DUNCAN', 'Ms. FOXX', 'Mr. HATCH', 'Mr. HERGER'.  

Try swapping "Mr. STEVENS" for another one of these names, in double or single quotation marks.  Then run the line of code again using SHIFT+ENTER/SHIFT+RETURN.

In [14]:
one_speaker.head()

Unnamed: 0,speech,chamber,date,speaker,first_name,last_name,state,year,5yrperiod,index
4691690,Mr. President. I introduce for appropriate ref...,S,1970-01-23,Mr. STEVENS,Unknown,STEVENS,Unknown,1970,1970.0,4691690
4692540,Mr. President. will the Senator yield?,S,1970-01-26,Mr. STEVENS,Unknown,STEVENS,Unknown,1970,1970.0,4692540
4692542,As a former U.S. attorney. I want to commend t...,S,1970-01-26,Mr. STEVENS,Unknown,STEVENS,Unknown,1970,1970.0,4692542
4696348,Mr. President. I send to the desk an amendment...,S,1970-02-02,Mr. STEVENS,Unknown,STEVENS,Unknown,1970,1970.0,4696348
4696350,Mr. President. Alaskas mountainous coastal geo...,S,1970-02-02,Mr. STEVENS,Unknown,STEVENS,Unknown,1970,1970.0,4696350


### Look for one word

Next, let's search the data in the dataframe *one_speaker* for a keyword, 'environmentalist.' 

The lines of code below create a "variable" called "word1" whose value is the word "environmentalist."  

The next lines search the dataset *one_speaker* for the string contained in the variable *word1.*

We create a new dataset called *contains_word1.*  The contents of this new dataset are all the speeches by our speaker where that speaker uses word1 somewhere in the speech.

In [16]:
word1 = "environmentalist"

contains_word1 = one_speaker[one_speaker['speech'].str.contains(word1)].copy() # search the text for the presence of our keyword 

contains_word1.head()

Unnamed: 0,speech,chamber,date,speaker,first_name,last_name,state,year,5yrperiod,index
5056852,Mr. President. the Senate Commerce Subcommitte...,S,1972-02-14,Mr. STEVENS,Unknown,STEVENS,Unknown,1972,1970.0,5056852
5139059,Mr. President. the first quarter 1972 issue of...,S,1972-07-31,Mr. STEVENS,Unknown,STEVENS,Unknown,1972,1970.0,5139059
5225633,Mr. President. those who suggest the oil from ...,S,1973-03-28,Mr. STEVENS,Unknown,STEVENS,Unknown,1973,1970.0,5225633
5282325,Mr. President. I had hoped to ask the Senator ...,S,1973-07-11,Mr. STEVENS,Unknown,STEVENS,Unknown,1973,1970.0,5282325
5283156,Not at this time. I would like to be specific....,S,1973-07-12,Mr. STEVENS,Unknown,STEVENS,Unknown,1973,1970.0,5283156


### Saving the data as a file

Finally, let's save our work in a format where we can easily open it in Microsoft Excel and then paste it as a table into Microsoft Word.

Microsoft Excel will make it easier to read the full speech. 

When we work with data, most of the time we save the data as a "comma separated values" file, or "CSV."  Excel has no trouble reading CSV files as a normal table.  

Use the following command to switch from the "scratch" folder, where we got the data, to your home folder.

Place your cursor on the box below and press SHIFT+ENTER/SHIFT+RETURN.  Note that the computer tells you that it is looking at a folder with your name in it.  This is your home folder.

In [18]:
cd ~/

/users/jguldi


The next line saves our dataframe, *contains_word1*, as a CSV file with  the name 'kwic.csv'.  

Notice that you can call the output file whatever you want.  Just change the name of the file inside the quotation marks.

In [19]:
contains_word1.to_csv('kwic.csv')

In the left-hand navigation pane, you should be able to see the new file.  

If you do not see a left-hand navigation pane, you probably loaded "JUPYTER NOTEBOOK" rather than "JUPYTER LAB" in the launch pane on hpc.smu.edu.  Just start a new JUPYTER LAB session from hpc.smu.edu and you should see the file. 

### From CSV file to Microsoft Word Table

You can use LEFT CLICK / CONTROL+CLICK with your mouse to select the new file from the left pane.  

One of your options is "download."  Select this option to download the file.

Next, open Microsoft Excel on your own computer.  Choose FILE > IMPORT.  Tell your computer that you want to import a CSV file.  Then navigate to your downloads folder and find your new file.  If you are asked for any other parameters, choose the default. 

Once you have the data in Excel, select CTRL+A to highlight everything.  Select CTRL+C to copy it.  

Then create a Microsoft Word document and select CTRL+V to paste the new data as a table. 

You can now format the data however you want.

### The Classic KWIC View

The following lines of code work to display a keyword in its context.  

Note that we define the variable *word1* again here.  You may change the content of the variable word1 to match any word you're looking for.

We also define the variable *n*, which represents the number of words to display before and after word1.

In [2]:
word1 = "environmentalist"


def KWIC_bigram(body):
    n = 50 # specify the number of surrounding words to use before and after the keyword
    
    words = body.split() # split the words into tokens
    keyword_index = -1
    
    # The following for loop iterates through each word in the text. If our keyword is found, the for loop stores its index. 
    # But, why are we storing the index? As you might remember, the index is a numerical representation of the row's position.
    # It can also be thought of as a unique name the computer uses to identify a row. We can call the index to return JUST these rows.
    for index, word in enumerate(words): 
        if index + 1 < len(words) and word1 in (word +  " " + words[index + 1]): 
            keyword_index = index
            break
    before_keyword = ' '.join(words[max(0, keyword_index - n):keyword_index]) # store the words that come before the keyword, up to our specified number 
   # keyword = list(words[keyword_index].upper())
    after_keyword = ' '.join(words[keyword_index + 2:keyword_index + n + 2]) # store the words that come after the keyword, up to our specified number 
    return before_keyword + " \033[1;3m" + word1.upper() + "\033[0m " + after_keyword # return the keyword in its context

contains_word1['context'] = contains_word1['speech'].apply(KWIC_bigram)

new_df = []
for n in range(0,len(contains_word1)):
    
    print(contains_word1['context'].iloc[n])
    print(contains_word1['date'].iloc[4])

NameError: name 'contains_word1' is not defined