# Introduction to Statistical Thinking in Python

### Learning Goals
* Gain a big-picture understanding of how we thinkg about the mathematical relationship between two variables
* Gain an intution for how this looks in Python
* Prepare for working in Pandas yourself
* Learn the powerful groupby function in Pandas to summarize real data

# Part 1: Summarizing Numerical Data

How do we summarize numerical data? You need two numbers to summarize data: mean and standard deviation.

![datapoints.jpg](attachment:datapoints.jpg)
![groupmeans.jpg](attachment:groupmeans.jpg)
![standarddeviation.jpg](attachment:standarddeviation.jpg)
![standarddeviation2.jpg](attachment:standarddeviation2.jpg)

# Part 1: Visualizing the relationship between two variables
Twitter just upped their character quota from 140 to 280. My question: how many characters per sentence? Obviously we can simply count the number of characters over a large number of sentences and then average it. But what is a sentence? Can we answer this question without dividing a text into sentences?

I will do so by looking at the relationship between two variables in a sample text: the number of characters in a chapter and the number of periods (the full stop punctuation).

In [1]:
#import necessary libraries
#I'll use numpy in an example, but we won't go deeper into this library
import pandas
import numpy as np
import csv

import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
%matplotlib inline

In [2]:
#Reminder: Computers are really fast at math!
2 + 3

5

In [3]:
2 * 3

6

In [4]:
1 + 2 * (3 * 4 * 5 // 6) ** 3 + 7 + 8 - 9 + 10

2017

In [5]:
#read our data into Python using the Pandas library
t = pandas.read_csv('../data/little_women_chapters.csv')
t

Unnamed: 0,Chapter,Periods,Characters
0,1,189,21759
1,2,188,22148
2,3,231,20558
3,4,195,25526
4,5,255,23395
5,6,140,14622
6,7,131,14431
7,8,214,22476
8,9,337,33767
9,10,185,18508


In [None]:
#visualize the relationship between the number of periods in a chapter and the number of characters
#What does this graph tell you?
t.plot.scatter(1, 2)

In [None]:
#Draw a line that approximates the mathematical relationship between two variables
#Reminder: how do we define a line between two points?

def draw_line(a, b):
    x = np.array([50, 450])
    y = a * x + b
    
    t.plot.scatter(1, 2)
    plots.plot(x, y)

In [None]:
draw_line(140, 0)

In [None]:
#How do we understand how well the line "fits" the data?
#Maybe by looking at how far each point is from that line.
#We can draw the lines in to help us visualize it

def draw_line(a, b):
    x = np.array([50, 450])
    y = a * x + b
    
    t.plot.scatter(1, 2)
    plots.plot(x, y)
    
    for _, row in t.iterrows():
        plots.plot([row.Periods, row.Periods], [row.Characters, a * row.Periods + b], color='r', lw=0.5)

In [None]:
draw_line(140, 0)

In [None]:
#Looks to me like the slope is too great. 
#How do I know this?

#Try a smaller slope.

draw_line(80, 0)

In [None]:
#Slope looks better, but the entire line is too low
#How do I know this?
#Raise the line

draw_line(80, 5000)

In [None]:
#Looking is one thing, but how do I mathematically define the fit of the line to the data?
#One was is taking the average squared difference between the point and the line
def average_squared_distance(a, b):
    x = t['Periods']
    y = t['Characters']
    fitted = a * x + b
    return np.average((y - fitted) ** 2) / 1e6

In [None]:
average_squared_distance(140, 0)

In [None]:
average_squared_distance(80, 0)

In [None]:
average_squared_distance(80, 5000)

In [None]:
#Insert hypothetical fuction to determine which numbers minimize the average squared distance
#The function would output the slope and intersect for the line that minimizes this difference

In [None]:
#This is what the funtion would tell us
draw_line(86.97784106,  4744.78483438)

What have we learned about the relationship between the number of periods and number of characters?

Summarize what we have learned, but also tell me mathematically.

# Part 3: Working with Real Data!

We'll focus on the numerical column 'score', and think through different ways of summarizing and visualizing it.

In [None]:
df = pandas.read_csv("../data/BDHSI2016_music_reviews.csv", sep = '\t')
df

In [None]:
####Exercise 1: Summarize your data

In [None]:
#To summarize non-numeric columns use the value_counts() function
df['genre'].value_counts()

In [None]:
#Next: visualize it using a histogram
#This is good practice whenever working with data. Before you do anything, visualize it!
#What did we learn?
df.hist()

In [None]:
#There are different genres in the data. Perhaps the average score is different for each genre?
#We can compare groups using the groupby() function

grouped = df.groupby('genre')
grouped

In [None]:
#Let's compare the mean!
grouped.mean()

In [None]:
#sort the values
grouped.mean().sort_values(by='score',ascending=False)

In [None]:
#we can do any calculation
#it will do the same calculation on every column
grouped.max()

In [None]:
#Visualize it!
#add the .plot() function to our calculation
grouped.mean().plot(kind='bar')

In [None]:
#sort the values to make it easier to see
grouped.mean().sort_values(by='score',ascending=False).plot(kind='bar')

In [None]:
#Add error bars to indicate variance
grouped.mean().sort_values(by='score',ascending=False).plot(kind='bar', yerr=grouped.std())

## Functions inside of Pandas!

What about that awesome text column? What can we do with that? We can do anything with it, by applying our own function to it.

Let's do something simple and add a word count column, and compare accross genres. Perhaps reviews are more or less verbose depending on genre.

In [None]:
##define a function that counts words
##we've seen this before

def count_words(text):
    return len(text.split())

In [None]:
#create a copy of our dataframe so we don't muck up the original
df_wc = df.copy()

##create a new column in our df and apply our function, to the 'body' column


df_wc['word_count'] = df_wc['body'].apply(count_words)

#view our dataframe
df_wc

In [None]:
###Exercise 2a: Summarize the word_count column
###Exercise 2b: Visualize the word_count column
###Exercise 2c: Compare word counts by genre

In [None]:
###Exercise 3: Visualize the average word count by genre, with error bars

In [None]:
#Can can put both columns on the same graph
grouped_wc.mean().sort_values(by='score', ascending=False).plot(kind='bar', yerr=grouped_wc.std())

In [None]:
#is there a relationship between score and word_count?
df.plot(kind='scatter', x = 'score', y = 'word_count')

In [None]:
###Exercise 4: Visualize the relationship between average score and average word count, by genre
###Hint: use a scatter plot on the grouped dataframe

In [None]:
#Save the dataframe with the extra column to a csv file on your harddrive.
df_wc.to_csv("../data/BDHSI2016_music_reviews_with_wordcount.csv")

When done add your name to the notebook, save it, and upload it to Blackboard.