# PyCon 2019: Data Science Best Practices with pandas ([video](https://www.youtube.com/watch?v=dPwLlJkSHLo&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=36))

### GitHub repository: https://github.com/justmarkham/pycon-2019-tutorial

### Instructor: Kevin Markham

- Website: https://www.dataschool.io
- YouTube: https://www.youtube.com/dataschool
- Patreon: https://www.patreon.com/dataschool
- Twitter: https://twitter.com/justmarkham
- GitHub: https://github.com/justmarkham

## 1. Introduction to the TED Talks dataset

https://www.kaggle.com/rounakbanik/ted-talks

## 2. Which talks provoke the most online discussion?

In [None]:
# sort by the number of first-level comments, though this is biased in favor of older talks

In [None]:
# correct for this bias by calculating the number of comments per view

In [None]:
# interpretation: for every view of the same-sex marriage talk, there are 0.002 comments

In [None]:
# make this more interpretable by inverting the calculation

In [None]:
# interpretation: 1 out of every 450 people leave a comment

Lessons:

1. Consider the limitations and biases of your data when analyzing it
2. Make your results understandable

## 3. Visualize the distribution of comments

In [None]:
# line plot is not appropriate here (use it to measure something over time)

In [None]:
# histogram shows the frequency distribution of a single numeric variable

In [None]:
# modify the plot to be more informative

In [None]:
# check how many observations we removed from the plot

In [None]:
# can also write this using the query method

In [None]:
# can also write this using the loc accessor

In [None]:
# increase the number of bins to see more detail

In [None]:
# boxplot can also show distributions, but it's far less useful for concentrated distributions because of outliers

Lessons:

1. Choose your plot type based on the question you are answering and the data type(s) you are working with
2. Use pandas one-liners to iterate through plots quickly
3. Try modifying the plot defaults
4. Creating plots involves decision-making

## 4. Plot the number of talks that took place each year

Bonus exercise: calculate the average delay between filming and publishing

In [None]:
# event column does not always include the year

In [None]:
# dataset documentation for film_date says "Unix timestamp of the filming"

In [None]:
# results don't look right

[pandas documentation for `to_datetime`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)

In [None]:
# now the results look right

In [None]:
# verify that event name matches film_datetime for a random sample

In [None]:
# new column uses the datetime data type (this was an automatic conversion)

In [None]:
# datetime columns have convenient attributes under the dt namespace

In [None]:
# similar to string methods under the str namespace

In [None]:
# count the number of talks each year using value_counts()

In [None]:
# points are plotted and connected in the order you give them to pandas

In [None]:
# need to sort the index before plotting

In [None]:
# we only have partial data for 2017

Lessons:

1. Read the documentation
2. Use the datetime data type for dates and times
3. Check your work as you go
4. Consider excluding data if it might not be relevant

## 5. What were the "best" events in TED history to attend?

In [None]:
# count the number of talks (great if you value variety, but they may not be great talks)

In [None]:
# use views as a proxy for "quality of talk"

In [None]:
# find the largest values, but we don't know how many talks are being averaged

In [None]:
# show the number of talks along with the mean (events with the highest means had only 1 or 2 talks)

In [None]:
# calculate the total views per event

Lessons:

1. Think creatively for how you can use the data you have to answer your question
2. Watch out for small sample sizes

## 6. Unpack the ratings data

In [None]:
# previously, users could tag talks on the TED website (funny, inspiring, confusing, etc.)

In [None]:
# two ways to examine the ratings data for the first talk

In [None]:
# this is a string not a list

In [None]:
# convert this into something useful using Python's ast module (Abstract Syntax Tree)

In [None]:
# literal_eval() allows you to evaluate a string containing a Python literal or container

In [None]:
# if you have a string representation of something, you can retrieve what it actually represents

In [None]:
# unpack the ratings data for the first talk

In [None]:
# now we have a list (of dictionaries)

In [None]:
# define a function to convert an element in the ratings Series from string to list

In [None]:
# test the function

In [None]:
# Series apply method applies a function to every element in a Series and returns a Series

In [None]:
# lambda is a shorter alternative

In [None]:
# an even shorter alternative is to apply the function directly (without lambda)

In [None]:
# check that the new Series looks as expected

In [None]:
# each element in the Series is a list

In [None]:
# data type of the new Series is object

In [None]:
# object is not just for strings

Lessons:

1. Pay attention to data types in pandas
2. Use apply any time it is necessary

## 7. Count the total number of ratings received by each talk

Bonus exercises:

- for each talk, calculate the percentage of ratings that were negative
- for each talk, calculate the average number of ratings it received per day since it was published

In [None]:
# expected result (for each talk) is sum of count

In [None]:
# start by building a simple function

In [None]:
# pass it a list, and it returns the first element in the list, which is a dictionary

In [None]:
# modify the function to return the vote count

In [None]:
# pass it a list, and it returns a value from the first dictionary in the list

In [None]:
# modify the function to get the sum of count

In [None]:
# looks about right

In [None]:
# check with another record

In [None]:
# looks about right

In [None]:
# apply it to every element in the Series

In [None]:
# another alternative is to use a generator expression

In [None]:
# use lambda to apply this method

In [None]:
# another alternative is to use pd.DataFrame()

In [None]:
# use lambda to apply this method

In [None]:
# do one more check

Lessons:

1. Write your code in small chunks, and check your work as you go
2. Lambda is best for simple functions

## 8. Which occupations deliver the funniest TED talks on average?

Bonus exercises:

- for each talk, calculate the most frequent rating
- for each talk, clean the occupation data so that there's only one occupation per talk

### Step 1: Count the number of funny ratings

In [None]:
# "Funny" is not always the first dictionary in the list

In [None]:
# check ratings (not ratings_list) to see if "Funny" is always a rating type

In [None]:
# write a custom function

In [None]:
# examine a record in which "Funny" is not the first dictionary

In [None]:
# check that the function works

In [None]:
# apply it to every element in the Series

In [None]:
# check for missing values

### Step 2: Calculate the percentage of ratings that are funny

In [None]:
# "gut check" that this calculation makes sense by examining the occupations of the funniest talks

In [None]:
# examine the occupations of the least funny talks

### Step 3: Analyze the funny rate by occupation

In [None]:
# calculate the mean funny rate for each occupation

In [None]:
# however, most of the occupations have a sample size of 1

### Step 4: Focus on occupations that are well-represented in the data

In [None]:
# count how many times each occupation appears

In [None]:
# value_counts() outputs a pandas Series, thus we can use pandas to manipulate the output

In [None]:
# show occupations which appear at least 5 times

In [None]:
# save the index of this Series

### Step 5: Re-analyze the funny rate by occupation (for top occupations only)

In [None]:
# filter DataFrame to include only those occupations

In [None]:
# redo the previous groupby

Lessons:

1. Check your assumptions about your data
2. Check whether your results are reasonable
3. Take advantage of the fact that pandas operations often output a DataFrame or a Series
4. Watch out for small sample sizes
5. Consider the impact of missing data
6. Data scientists are hilarious