# Dictionaries and Dataframes

Now that we've learned some Python basics, we'll move to applying our tools to do more sophisticated data analyses. To do so, we often don't want lists, but we want certain elements to be associated with values. A person will have an income, for example, or a novel will use the word `humanistic` a certain number of times. Today we'll think through data types that can help us we these associations.

## Learning Goals
* Learn about dictionaries, tuples, and dataframes
* Get comfortable with manipulating them, and slicing them
* Think through how we might use them for data analysis (more on this to come!)

# Part 1: Dictionaries

Like lists, dictionaries can easily be changed, can be shrunk and grown ad libitum at run time. They shrink and grow without the necessity of making copies. Dictionaries can be contained in lists and vice versa. 

But what's the difference between lists and dictionaries? Lists are ordered sets of objects, whereas dictionaries are unordered sets. But the main difference is that items in dictionaries are accessed via keys and not via their position. A dictionary is an associative array (also known as hashes). Any key of the dictionary is associated (or mapped) to a value. The values of a dictionary can be any Python data type. So dictionaries are unordered key-value-pairs. 


In [1]:
my_tuple = [('education', 'high school'), ('income', 100)]
my_dict = dict(my_tuple)
type(my_dict)

dict

In [2]:
my_dict

{'education': 'high school', 'income': 100}

The key is before the colon, the value is after the colon. 

Find all the keys from the dictionary, and then all the values.

In [3]:
my_dict.keys()

dict_keys(['education', 'income'])

In [4]:
my_dict.values()

dict_values(['high school', 100])

We can access keys using the bracket syntax. We've seen this before. The input is a dictionary key, the output is the key's value.

In [5]:
my_dict['education']

'high school'

In [6]:
my_dict['income']

100

We can add key/value pairs using the bracket syntax and the assignment operator. Notice the order of the key/value pairs does not matter, like they do in lists and strings.

In [7]:
my_dict['age'] = 24
my_dict

{'age': 24, 'education': 'high school', 'income': 100}

## Example: Counting Words

We have been looking at different features of "words" (or, as Python knows them, elements in a string separated by white space). What if we want to find the number of times each word occurs in a text? We can use the `counter` class in Python, which utilizes dictionaries and another datatype, tuples. Let's walk through an example

One of the most frequent tasks in computational text analysis is quickly summarizing the content of text. In this lesson we will learn how to summarze text by counting frequent words in the text. In the process we'll learn to think about features, which words are important, and we'll cover some common pre-processing steps. 

This techniques fits under the umbrella of Natural Language Processing, a term that incorporates many techiques and methods to process, analyze, and understand natural languages (as opposed to artificial languages like logics, or Python).


### Lesson Outline:
- Tokenizing Text and Type-Token Ratio
    * Number of words
    * Type-Token Ratio
- Most frequent words


### Key Terms:

* *stop words*: 
    * The most common words in a language.
* *token*:
    *  A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.
* *type*:
    * A type is the class of all tokens containing the same character sequence.

## 0. Let's begin!

First, we assign a sample sentence, our "text", to a variable called "sentence".

Note: This sentence is a quote about what digital humanities means, from digital humanist Kathleen Fitzpatrick. Source: "On Scholarly Communication and the Digital Humanities: An Interview with Kathleen Fitzpatrick", *In the Library with the Lead Pipe*

In [8]:
#assign the desired sentence to the variable called 'sentence.'
sentence = "For me it has to do with the work that gets done at the crossroads of \
digital media and traditional humanistic study. And that happens in two different ways. \
On the one hand, it's bringing the tools and techniques of digital media to bear \
on traditional humanistic questions; on the other, it's also bringing humanistic modes \
of inquiry to bear on digital media."

#print the content
print(sentence)

For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it's bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it's also bringing humanistic modes of inquiry to bear on digital media.


## 1. Tokenizing Text and Type-Token Ratios

I'll write a function to tokenize any text, called word_tokenzie. I will lowercase the text in this step.

In [9]:
#For punctuation use the list from the string library
import string
punct_list = list(string.punctuation)

#Exercise 1 code here
def word_tokenize(text):
    text = text.lower()
    text_clean = ''.join([e for e in text if e not in punct_list])
    text_token =  text_clean.split()
    return text_token

In [10]:
#complete the line below
sentence_tokens = word_tokenize(sentence)
sentence_tokens

['for',
 'me',
 'it',
 'has',
 'to',
 'do',
 'with',
 'the',
 'work',
 'that',
 'gets',
 'done',
 'at',
 'the',
 'crossroads',
 'of',
 'digital',
 'media',
 'and',
 'traditional',
 'humanistic',
 'study',
 'and',
 'that',
 'happens',
 'in',
 'two',
 'different',
 'ways',
 'on',
 'the',
 'one',
 'hand',
 'its',
 'bringing',
 'the',
 'tools',
 'and',
 'techniques',
 'of',
 'digital',
 'media',
 'to',
 'bear',
 'on',
 'traditional',
 'humanistic',
 'questions',
 'on',
 'the',
 'other',
 'its',
 'also',
 'bringing',
 'humanistic',
 'modes',
 'of',
 'inquiry',
 'to',
 'bear',
 'on',
 'digital',
 'media']

In [11]:
#total number of words
len(sentence_tokens)

63

### Type-Token Ratio

One quick calculation we can do on the text is determine it's type-token ratio.

We know what a token is. But many tokens are repeated in a text. For example, in this sentence, the token "the" appears 5 times. "The" is a type. The 5 "the"s in the sentence are tokens. The TTR is simply the number of types divided by the number of tokens. A high TTR indicates a large amount of lexical variation or lexical diversity and a low TTR indicates relatively little lexical variation. The type-token ratio of speech, for example, is less than that of written language. What might we expect from some of the novels/texts we have been analyzing?

To get a subset of our list that only contains one element of each type, we can use the `set` function:

In [12]:
set(sentence_tokens)

{'also',
 'and',
 'at',
 'bear',
 'bringing',
 'crossroads',
 'different',
 'digital',
 'do',
 'done',
 'for',
 'gets',
 'hand',
 'happens',
 'has',
 'humanistic',
 'in',
 'inquiry',
 'it',
 'its',
 'me',
 'media',
 'modes',
 'of',
 'on',
 'one',
 'other',
 'questions',
 'study',
 'techniques',
 'that',
 'the',
 'to',
 'tools',
 'traditional',
 'two',
 'ways',
 'with',
 'work'}

Question: What do you notice about this output? What datatype is it?

### Exercise 1.1: Print the type-token ratio for the sentence

In [13]:
##Exericse 1.1 code here
len(set(sentence_tokens))/len(sentence_tokens)

0.6190476190476191

## 2. Counting Words

We are often also interested in the most frequent words, which can help us quickly summarize a text. We can do this by looping through our sentence tokens variable and creating a counts dictionary.

Let's walk through this code slowly.

In [14]:
counts = dict()
for word in sentence_tokens:
    if word not in counts:
        counts[word] = 1
    else:
        counts[word] += 1
counts

{'also': 1,
 'and': 3,
 'at': 1,
 'bear': 2,
 'bringing': 2,
 'crossroads': 1,
 'different': 1,
 'digital': 3,
 'do': 1,
 'done': 1,
 'for': 1,
 'gets': 1,
 'hand': 1,
 'happens': 1,
 'has': 1,
 'humanistic': 3,
 'in': 1,
 'inquiry': 1,
 'it': 1,
 'its': 2,
 'me': 1,
 'media': 3,
 'modes': 1,
 'of': 3,
 'on': 4,
 'one': 1,
 'other': 1,
 'questions': 1,
 'study': 1,
 'techniques': 1,
 'that': 2,
 'the': 5,
 'to': 3,
 'tools': 1,
 'traditional': 2,
 'two': 1,
 'ways': 1,
 'with': 1,
 'work': 1}

### Exercise 1.2: Using this dictionary, print the number of times the word 'humanistic' shows up in the sentence.

In [16]:
#Exercise 1.2 code here
counts["humanistic"]

3

## 3. Most Frequent Words

We'll have to creatively combine dictionaries and tuples to find the most frequent words in our sentence.

The dictionary method .items() returns a list of tuples. This will eventually allow us to sort through the tuples.

A `tuple` is a sequence of values much like a list. The values stored in a tuple can be any type, and they are indexed by integers. The important difference is that tuples are immutable. Tuples are also comparable and hashable so we can sort lists of them and use tuples as key values in Python dictionaries.

Syntactically, a tuple is a comma-separated list of values:

In [17]:
counts.items()

dict_items([('for', 1), ('me', 1), ('it', 1), ('has', 1), ('to', 3), ('do', 1), ('with', 1), ('the', 5), ('work', 1), ('that', 2), ('gets', 1), ('done', 1), ('at', 1), ('crossroads', 1), ('of', 3), ('digital', 3), ('media', 3), ('and', 3), ('traditional', 2), ('humanistic', 3), ('study', 1), ('happens', 1), ('in', 1), ('two', 1), ('different', 1), ('ways', 1), ('on', 4), ('one', 1), ('hand', 1), ('its', 2), ('bringing', 2), ('tools', 1), ('techniques', 1), ('bear', 2), ('questions', 1), ('other', 1), ('also', 1), ('modes', 1), ('inquiry', 1)])

In [18]:
#we can loop through these values like we might in a list, but notice the syntax here!
for key, value in counts.items():
    print(key, value)

for 1
me 1
it 1
has 1
to 3
do 1
with 1
the 5
work 1
that 2
gets 1
done 1
at 1
crossroads 1
of 3
digital 3
media 3
and 3
traditional 2
humanistic 3
study 1
happens 1
in 1
two 1
different 1
ways 1
on 4
one 1
hand 1
its 2
bringing 2
tools 1
techniques 1
bear 2
questions 1
other 1
also 1
modes 1
inquiry 1


In [19]:
freq_words = []
for key, val in counts.items():
    freq_words.append((val, key))

freq_words

[(1, 'for'),
 (1, 'me'),
 (1, 'it'),
 (1, 'has'),
 (3, 'to'),
 (1, 'do'),
 (1, 'with'),
 (5, 'the'),
 (1, 'work'),
 (2, 'that'),
 (1, 'gets'),
 (1, 'done'),
 (1, 'at'),
 (1, 'crossroads'),
 (3, 'of'),
 (3, 'digital'),
 (3, 'media'),
 (3, 'and'),
 (2, 'traditional'),
 (3, 'humanistic'),
 (1, 'study'),
 (1, 'happens'),
 (1, 'in'),
 (1, 'two'),
 (1, 'different'),
 (1, 'ways'),
 (4, 'on'),
 (1, 'one'),
 (1, 'hand'),
 (2, 'its'),
 (2, 'bringing'),
 (1, 'tools'),
 (1, 'techniques'),
 (2, 'bear'),
 (1, 'questions'),
 (1, 'other'),
 (1, 'also'),
 (1, 'modes'),
 (1, 'inquiry')]

In [20]:
freq_words.sort(reverse=True)
freq_words

[(5, 'the'),
 (4, 'on'),
 (3, 'to'),
 (3, 'of'),
 (3, 'media'),
 (3, 'humanistic'),
 (3, 'digital'),
 (3, 'and'),
 (2, 'traditional'),
 (2, 'that'),
 (2, 'its'),
 (2, 'bringing'),
 (2, 'bear'),
 (1, 'work'),
 (1, 'with'),
 (1, 'ways'),
 (1, 'two'),
 (1, 'tools'),
 (1, 'techniques'),
 (1, 'study'),
 (1, 'questions'),
 (1, 'other'),
 (1, 'one'),
 (1, 'modes'),
 (1, 'me'),
 (1, 'it'),
 (1, 'inquiry'),
 (1, 'in'),
 (1, 'has'),
 (1, 'happens'),
 (1, 'hand'),
 (1, 'gets'),
 (1, 'for'),
 (1, 'done'),
 (1, 'do'),
 (1, 'different'),
 (1, 'crossroads'),
 (1, 'at'),
 (1, 'also')]

In [21]:
for key, val in freq_words[:10]:
    print(key, val)

5 the
4 on
3 to
3 of
3 media
3 humanistic
3 digital
3 and
2 traditional
2 that


### Exercise 1.3: Print the 10 most *infrequent* words

In [22]:
#Exercise 1.3 code here
freq_words.sort()
for key, val in freq_words[:10]:
    print(key, val)

1 also
1 at
1 crossroads
1 different
1 do
1 done
1 for
1 gets
1 hand
1 happens


In [24]:
#alt code
freq_words.sort(reverse=True)
for key, val in freq_words[-10:]:
    print(key, val)

1 happens
1 hand
1 gets
1 for
1 done
1 do
1 different
1 crossroads
1 at
1 also


# Part 2: Pandas

<i>Pandas</i> is a popular and flexible package whose primary use is its datatype: the <i>DataFrame</i>. The dataframe is essentially a spreadsheet, like you would find in Excel, but it has some tricks up its sleeve!

As we will see, Pandas allows us to do basic statistics easily, allows us to compare columns, and allows us to do quick and easy visualizations. 

We will practice these uses of Pandas in the next three weeks. Today, I'm just planting the seed. We can easily transform lists of tuples, or lists of dictionaries, into a Pandas dataframe using the `pandas.DataFrame` method.

In [25]:
#get ready! import the pandas libarary
import pandas

In [26]:
#create a list of dictionaries, starting with the original dictionary, my_dict
df_list = []
my_dict

{'age': 24, 'education': 'high school', 'income': 100}

In [27]:
df_list.append(my_dict)
df_list

[{'age': 24, 'education': 'high school', 'income': 100}]

In [28]:
df_list.extend([{'age': 22, 'education': 'BA', 'income': 400}, {'age': 35, 'education': 'MA', 'income': 700}])
df_list

[{'age': 24, 'education': 'high school', 'income': 100},
 {'age': 22, 'education': 'BA', 'income': 400},
 {'age': 35, 'education': 'MA', 'income': 700}]

In [29]:
df = pandas.DataFrame(df_list)
df

Unnamed: 0,age,education,income
0,24,high school,100
1,22,BA,400
2,35,MA,700


In [30]:
#we can do the same but add row name, if we'd like
row_names = ['Prof. Nelson', 'Prof. Handel', 'Prof. Blum']
df = pandas.DataFrame(df_list, index=row_names)
df

Unnamed: 0,age,education,income
Prof. Nelson,24,high school,100
Prof. Handel,22,BA,400
Prof. Blum,35,MA,700


What can we do with this? A lot! Today we'll subset, slice, and do some basic arithmetic. We'll practice more with it in the coming weeks.

In [31]:
# Call up a column of the dataframe

df['income']

Prof. Nelson    100
Prof. Handel    400
Prof. Blum      700
Name: income, dtype: int64

In [32]:
# Call up a row from the indices
df.iloc[1]

age           22
education     BA
income       400
Name: Prof. Handel, dtype: object

In [33]:
#do the same using the name, but notice the syntax here
df.loc['Prof. Handel']

age           22
education     BA
income       400
Name: Prof. Handel, dtype: object

In [34]:
# Call up a couple of rows, using a list of indices

df.loc[['Prof. Nelson','Prof. Blum']]

Unnamed: 0,age,education,income
Prof. Nelson,24,high school,100
Prof. Blum,35,MA,700


In [35]:
# Get a specific entry by calling both row and column

df.loc['Prof. Nelson']['income']

100

In [36]:
# Who has the highest income?
#Temporarily re-order the dataframe by values in the 'income' column

df.sort_values('income', ascending=False)

Unnamed: 0,age,education,income
Prof. Blum,35,MA,700
Prof. Handel,22,BA,400
Prof. Nelson,24,high school,100


In [37]:
# Create a new column

df['gender'] = ['f','m','f']

In [38]:
# Inspect

df

Unnamed: 0,age,education,income,gender
Prof. Nelson,24,high school,100,f
Prof. Handel,22,BA,400,m
Prof. Blum,35,MA,700,f


### Exercise 2.1: Call up the entry 400 from the middle of the dataframe 'df'
### Exercise 2.2:  Call up the entry BA from the middle of the dataframe 'df' 
### Exercise 2.3: Call up both entries at the same time

In [39]:
#Exercise code here
df.iloc[1]["income"]

400

In [40]:
df.iloc[1]["education"]

'BA'

In [47]:
df.iloc[1][["income","education"]]

income       400
education     BA
Name: Prof. Handel, dtype: object

## DataFrame Subsetting

In [48]:
# Slice out a column

df['income']

Prof. Nelson    100
Prof. Handel    400
Prof. Blum      700
Name: income, dtype: int64

In [49]:
# Evaluate whether each element in the column is equal to 100

df['income']==100

Prof. Nelson     True
Prof. Handel    False
Prof. Blum      False
Name: income, dtype: bool

In [50]:
# We can also use evaluation to subset the table. This time we'll use the greater than evaluator.
df[df['income']>200]

Unnamed: 0,age,education,income,gender
Prof. Handel,22,BA,400,m
Prof. Blum,35,MA,700,f


### Exercise 2.4: Slice 'df' to contain only rows in which 'education' equals 'BA'

In [51]:
#Exercise 2.4 code here
df[df["education"]== "BA"]

Unnamed: 0,age,education,income,gender
Prof. Handel,22,BA,400,m


## Arithmetic! Statistics!

In [52]:
# Our dataframe

df

Unnamed: 0,age,education,income,gender
Prof. Nelson,24,high school,100,f
Prof. Handel,22,BA,400,m
Prof. Blum,35,MA,700,f


In [53]:
# Pandas will produce a few descriptive statistics for each row, but only columns that are numbers

df.describe()

Unnamed: 0,age,income
count,3.0,3.0
mean,27.0,400.0
std,7.0,300.0
min,22.0,100.0
25%,23.0,250.0
50%,24.0,400.0
75%,29.5,550.0
max,35.0,700.0


In [54]:
# Multiply entries of the dataframe by 10

df*10

Unnamed: 0,age,education,income,gender
Prof. Nelson,240,high schoolhigh schoolhigh schoolhigh schoolhi...,1000,ffffffffff
Prof. Handel,220,BABABABABABABABABABA,4000,mmmmmmmmmm
Prof. Blum,350,MAMAMAMAMAMAMAMAMAMA,7000,ffffffffff


In [55]:
# Add 10 to each entry

df+10

TypeError: Could not operate 10 with block values must be str, not int

We can't do it! Why not?

In [56]:
#We can do it if we specify a column

df['income']+10

Prof. Nelson    110
Prof. Handel    410
Prof. Blum      710
Name: income, dtype: int64

In [57]:
# Of course our dataframe hasn't changed

df

Unnamed: 0,age,education,income,gender
Prof. Nelson,24,high school,100,f
Prof. Handel,22,BA,400,m
Prof. Blum,35,MA,700,f


In [58]:
# What if we just want to add the values in the column?

df['income'].sum()

1200

In [59]:
# We can also perform operations among columns
# Pandas knows to match up individual entries in each column

df['income/age'] = df['income']/df['age']
df

Unnamed: 0,age,education,income,gender,income/age
Prof. Nelson,24,high school,100,f,4.166667
Prof. Handel,22,BA,400,m,18.181818
Prof. Blum,35,MA,700,f,20.0


### Exercise 2.5: `.sum()` adds the values in a column. `.mean()` calculates the mean value in a column.
### Find the mean income and the mean age for the dataframe `df`

In [61]:
#Exercise 2.5 code here
df["income"].mean()

400.0

In [62]:
df["age"].mean()

27.0

When you're done, change the name of the notebook to include your last name, and upload it to Blackboard. Be prepared to practice more Pandas in the weeks to come!