# Today

1. Welcome and introductions
2. What is Data Science? Math, programming, and specific knowledge
3. Setup: github and colaboratory
4. Review of python
5. Introduction to pandas

# Course Schedule

|Session|Date|Topic|Dataset|
|------------|
|1|Mar 6|Introduction and Pandas|IMDB movies|
|2|Mar 13|Data visualization|Spotify hits|
|3|Mar 20|Statistics: distributions and randomness|Basketball stats|
|4|Mar 27|Statistics: hypothesis testing|Polling and elections|
|5|Apr 3|Regression: single variable|World happiness|
|6|Apr 10|Regression: multivariable|World happiness|
|7|Apr 17|Machine learning|Text classification|
|8|May 1|FInal project prep||
|9|May 8|Final project presentations||




# Intro to github and colaboratory

We will use this repo: https://github.com/haroldfox/ts-stuy-2019

Navigate to https://colab.research.google.com/github

Set haroldfox/ts-stuy-2019 as the repo

Choose notebooks/01_exploring_data_in_python.ipynb from the **list**

The key data scientist's tool is the jupyter notebook, a mix of code and text. From colaboratory, you can add code cells or text cells and reorder them

# Review of python

In [0]:
"Hello " + "there"

In [0]:
1+1

In [0]:
['Lists', 'contain', 'multiple', 'items']

Dictionaries map keys to data

In [0]:
{'Yankees': 'New York', 'Red Sox': 'Boston', 'Blue Jays': 'Toronto', 'Dodgers': 'Los Angeles'}

In [0]:
def capitalize(s):
  return ' '.join([w[0].upper() + w[1:].lower() for w in s.split(' ')])

In [0]:
capitalize('a lower case sentence')

# Exploring Data in pandas

## 1. Getting Started 

### Welcome

Welcome to Jupyter, a notebook environment for python. We will use Jupyter notebooks throughout this course to explore, visualize, and analyze data. You can learn more about Jupyter here: http://jupyter.readthedocs.io/en/latest/index.html

### Import required python libraries

We begin by importing a few basic python libraries for data analysis and visualization. We will start with the following:

- pandas: data analysis
- numpy: numerical computing and linear algebra in python
- datetime: date and time functionality
- matplotlib: plotting and data visualization tools
- seaborn: interface to matplotlib for easier (and prettier) plotting

In [0]:
import pandas as pd   # We give the libraries short names for easier referencing
import numpy as np
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline  
# This allows us to display plots right within our Jupyter notebook

### Set notebook display options

Pandas is a python library for data analysis. We will begin by using it here to set display options so that we can see a larger number of rows and columns than the default. You can read more about setting options here: https://pandas.pydata.org/pandas-docs/stable/options.html



In [0]:
# -- What is the default maximum number of columns displayed?
pd.options.display.max_columns

In [0]:
# -- What is the default maximum number of rows displayed?
pd.options.display.max_rows

In [0]:
# -- We can change the default number of rows and columns
pd.options.display.max_columns = 50
pd.options.display.max_rows = 500

In [0]:
# -- Which other kinds of options are available?
dir(pd.options)

### Read documentation in your Jupyter notebook
Jupyter conveniently allows us to read documentation on python functions and other objects within our notebook. This can be done by either preceding or following the name of the object with a '?' (question mark). We illustrate this here with the 'set_option' function provided by pandas.

In [0]:
pd.set_option?

In [0]:
# -- 
'''
Getting Familiar with the ipython notebook environment
'''

"""
Session 1: Exploring Data in Python

1.	Getting Started
    - working in jupyter
    - display options
    - importing libraries (and what these libraries contain)
2.	Introduction to pandas
    - python brief review (control structures, data structures, functions, lambda)
    - pandas 
        - 
3.	Loading and Summarizing Data
4.	Visualizing Data in pandas
    - 

5.	Cleaning and Tidying Data

Session 2: Visualizing and Simulating Data

1.	Brief Recap of Last Session
    - What did we learn last time?
    - Review 
2.	Data Visualization
- 
3.	Handling Missing Data
4.	Random Numbers and Sampling
5.	Simulation for Data Analysis

"""

## 2. Introduction to pandas

For all this information and more, see https://pandas.pydata.org/pandas-docs/stable/10min.html

In this class, we'll be making extensive use of a Python library called Pandas. You may be familiar with native python data structures like lists and dictionaries. Pandas offers two data structures called DataFrames and Series which have a lot of built-in functionality that makes analyzing data easy.   

Let's start off by introducing the Series. Series objects are similar in flavor to dictionaries. They have an 'index' that you can think of as the key of a dictionary, which maps to values. In fact, you can create a Series from a dictionary:

In [0]:
underlying_dict = {i:2*i for i in range(10)} # This is called 'list comprehension'! 
sers = pd.Series(underlying_dict)
sers  

In [0]:
# Access entries by .loc:
sers.loc[3]

The index need not be an integer. For example:

In [0]:
underlying_dict = {'a':0, 'b':1, 'c':2}
pd.Series(underlying_dict).loc['a']

In [0]:
# You can also make a Series from a list if you use Pandas' 
# default integer index:
pd.Series([2*i for i in range(10)])

You can think of a DataFrame as a 2D array or matrix. Like a Series, a DataFrame has an index, but DataFrames also have different columns. We can build up a DataFrame as a dictionary of multiple Series':

In [0]:
sers_one = pd.Series({'a':0,'b':1,'c':2})
sers_two = pd.Series({'a':3,'b':4,'c':5})
my_first_df = pd.DataFrame({'sers_one':sers_one, 'sers_two':sers_two})
my_first_df

In [0]:
# Access a column this way:
display(my_first_df['sers_one'])

# Access a row this way:
display(my_first_df.loc['a'])

In [0]:
# We can build up a DataFrame much faster than that. For instance, 
# check the output of the following: np.random.normal(size=(10,3))
# It returns a 2D array of normal random variables. 

my_second_df = pd.DataFrame(
    np.random.normal(size=(10,3)),columns = ['a','b','c'])
my_second_df

Let's say we have a function that takes in rows of a DataFrame and spits out a number. We may want to apply this function to each row of our DataFrame and store the results in a new Series. Below, let's make up a simple function, and use a for-loop to accomplish this:

In [0]:
def f(row):
    return 2.*row['a']-row['b']*(row['c'])

In [0]:
apply_f_to_rows = {}
for i,row in my_second_df.iterrows():
    apply_f_to_rows[i] = f(row)
apply_f_to_rows = pd.Series(apply_f_to_rows)

We can accomplish the above with a single line of code using df.apply:

In [0]:
# Below, we set axis = 1 to specify that we want the function to be applied to each row.
# If you come up with a function that can take columns, you should set axis = 0
apply_f_to_rows_faster = my_second_df.apply(f,axis = 1)

# Test that the two results are the same!
# Hint: remember that to test if two values are the same we use ==
# Try np.all([True,True,False]) for a counter example:
np.all(apply_f_to_rows_faster == apply_f_to_rows)

In [0]:
my_second_df['d'] = 2 * my_second_df['a'] - my_second_df['b'] * my_second_df['c']
my_second_df

## 3. Loading and Summarizing Data  

### Load data

Data can be loaded into a pandas data frame from a variety of file formats. In this session, we will load data on 5000 movies in The Movie Database (TMDB) from a csv file. To see the first few rows of a data frame, use df.head() as in the example below.

This dataset is available online on Kaggle: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
There are many other datasets available from Kaggle.

In [0]:
!git clone https://<your username>:<your password>@github.com/haroldfox/ts-stuy-2019

In [0]:
!ls

In [0]:
movies = pd.read_csv('ts-stuy-2019/datasets/movie_metadata.csv')

In [0]:
movies.head(10)  # Try changing the number of rows and removing the argument to check what pandas uses as the default value

In [0]:
movies.head(10).T

We can view the dimensions of a data frame using df.shape. For example, here we see that our movies data has 5043 rows and 28 columns 

In [0]:
movies.shape

We can see the names of columns in a data frame using df.columns

In [0]:
movies.columns

In [0]:
movies.groupby('actor_2_name').count()['movie_title'].sort_values(ascending=False).head(20)

In [0]:
movies[movies['actor_2_name'] == 'Jason Flemyng'][['movie_title', 'director_name', 'title_year']]

When in doubt about the correct way to call a function or use a python object, we can access documentation on it from 
right within the notebook by following its name with a '?' (question mark)
The line below retrieves documentation on the 'read_csv' function that we used above to load data from a CSV file.

In [0]:
pd.read_csv?

### Display summary and descriptive statistics
We can summarize the contents of a data frame using the df.describe() method.


In [0]:
movies.describe()

In [0]:
movies.sort_values('gross', ascending=False).head(30)[['movie_title', 'gross']]

### Select rows and columns using .loc
We can select rows and columns of a data frame using .loc, a versatile method that allows us to specify rows and columns either by label or by specifying a condition that needs to be true. Which of the movies in the dataset grossed over $400 million?

1. We select the original_title and gross amount columns for all movies that grossed over $400 million.
2. We sort the selected data by gross amount (descending) to rank the highest-grossing movies.

In [0]:
movies[movies.gross>400e6][['movie_title', 'gross']].sort_values('gross', ascending=False)

### **TRY THIS!**
Ask some interesting questions about the movie data that can be answered by filtering and sorting or ranking the dataset. Then answer the questions using what you have learned about .loc and .sort_values.

### Group data
How many movies from each year of release are in the dataset? We can find out easily by first grouping the data by genre and then counting the number of rows in each genre as shown below.

In [0]:
movies.groupby('title_year').count()

### Group on filtered data
If we only want the movies per year after 2007 then we can first filter the data, then count the number of movies in each group.

In [0]:
movies[(movies.title_year>=2007)].groupby('title_year')['movie_title'].count()

## 4. Visualizing Data in pandas

### Plot data from a data frame 

In [0]:
movies[(movies.title_year>=2007)].groupby('title_year')[['movie_title']].count().plot()

In [0]:
movies[(movies.title_year>=2007)].groupby('title_year')[['movie_title']].count().plot(kind='bar')

In [0]:
movies[(movies.title_year>=2007)].groupby('title_year')[['movie_title']].count().plot(kind='barh')

# Try it

## Python review

Write a function to add a certain number of minutes to a time

In [0]:
def add_minutes(num_minutes, hour, minute):
  # return a time tuple
  return (hour, minute)

In [0]:
assert((4, 0) == add_minutes(80, 2, 40))

## Analyzing IMDB dataset with pandas

Which director with at least 3 films has the highest average gross? Who has the highest average imdb_score?

What are the highest-grossing films by genre? What are the highest imdb_score by genre?

Which actors with at least 3 films have the highest gross? Which ones have the highest imdb_score? You will have to combine actor_1, actor_2, and actor_3 together. Hint, you will want to look up concat, rename, and copy

Movie gross does not adjust for inflation. Calculate the average gross per movie per year. Calculate adjusted_gross as the ratio between gross and the median yearly gross. Which movies have the highest adjusted_gross?