# Intro to dataframes, functions, control flow

[Anaconda Install](https://www.anaconda.com/download/success)

# Contents:
1. Dataframes 
2. Importing a csv file into a dataframe 
3. Working with dataframes

# Part 1: Dataframes

In python we store tables (the fundamental object in data science) in what are called dataframes. Formally, a dataframe is a dictionary of arrays. We need to import a package called "pandas" in order to construct dataframes. Pandas is a data science package that we will use a lot in this course.

[Pandas Documentation](https://pandas.pydata.org/docs/)

In [1]:
import pandas as pd

First let us create a dictionary of lists and see how we use it to construct a pandas dataframe. (Typically you will just import an existing csv file into a dataframe, but you can also add rows or observations manually, this provides some insight into the structure of dataframes. Notice how similar a dataframe is to a table in a database).

In [3]:
person = {'name': ['John', 'James', 'Jack'], 'age': [45, 28, 18]}

person_df = pd.DataFrame.from_dict(person) 

person_df

Unnamed: 0,name,age
0,John,45
1,James,28
2,Jack,18


We can access column of the dataframe just as we would the entry of a dictionary:

In [6]:
person_df['name'].astype(str) 
type(person_df['name'].loc[0])

str

It is straightforward to add columns to a dataframe with similar code. For instance, we can add a column called location.

In [7]:
person_df['location'] = ['Pittsburgh', 'Pittsburgh', 'Philadelphia'] 

person_df

Unnamed: 0,name,age,location
0,John,45,Pittsburgh
1,James,28,Pittsburgh
2,Jack,18,Philadelphia


# Part 2: Importing csv files into a dataframe 
First you must upload the csv file into the folder that your Jupyter notebook is in. You can do this by: 
- Clicking on the juyper icon in the top left corner of this page
- Then clicking on the upload button in the top right corner of the page you and on 

Go ahead and upload the "twitter.csv" file posted on canvas so that we can import it. We import the twitter csv file into a dataframe called twitter with the following code. 

In [3]:
twitter = pd.read_csv("twitter.csv")

twitter['author'].unique()

array(['katyperry', 'justinbieber', 'taylorswift13', 'BarackObama',
       'rihanna', 'YouTube', 'ladygaga', 'TheEllenShow', 'Twitter',
       'jtimberlake', 'KimKardashian', 'britneyspears', 'Cristiano',
       'selenagomez', 'cnnbrk', 'jimmyfallon', 'ArianaGrande', 'shakira',
       'instagram', 'ddlovato'], dtype=object)

We see that the table has 52542 observations and 10 variables. It is always a good idea to take a quick look at the data see what we are working with/if it makes sense. The following command provides summary statistics for each variable, which are often helpful for this purpose. 

In [60]:
twitter.describe()

Unnamed: 0,id,latitude,longitude,number_of_likes,number_of_shares
count,52542.0,1.0,1.0,52542.0,52542.0
mean,5.741141e+17,37.776973,-122.416523,9637.838339,5386.880857
std,2.009723e+17,,,18759.083482,11517.259484
min,6789717000.0,37.776973,-122.416523,0.0,0.0
25%,4.485852e+17,37.776973,-122.416523,916.0,378.0
50%,6.337935e+17,37.776973,-122.416523,2595.5,1266.0
75%,7.336588e+17,37.776973,-122.416523,10300.75,5205.0
max,8.2372e+17,37.776973,-122.416523,429159.0,219062.0


# Part 3: Working with dataframes
1. Checking/changing variables types 
2. Creating new variables 

Recall that raw data is messy, and it is usually not in the format we would like. Consider the variable date_time: 

In [113]:
twitter['date_time']

0        12/01/2017 19:52
1        11/01/2017 08:38
2        11/01/2017 02:52
3        11/01/2017 02:44
4        10/01/2017 05:22
               ...       
52537    06/01/2015 23:10
52538    06/01/2015 02:17
52539    05/01/2015 03:42
52540    05/01/2015 00:06
52541    05/01/2015 00:02
Name: date_time, Length: 52542, dtype: object

It is being read as an "object" that contains the date and time the tweet was posted. The object variable type is a generic type that Python assigns to variables when it does not know what they are. Python actually has a date variable type, so let's convert it to that. 

In [118]:
twitter['date_time'] = pd.to_datetime(twitter['date_time'])

In [115]:
twitter['date_time']

0       2017-12-01 19:52:00
1       2017-11-01 08:38:00
2       2017-11-01 02:52:00
3       2017-11-01 02:44:00
4       2017-10-01 05:22:00
                ...        
52537   2015-06-01 23:10:00
52538   2015-06-01 02:17:00
52539   2015-05-01 03:42:00
52540   2015-05-01 00:06:00
52541   2015-05-01 00:02:00
Name: date_time, Length: 52542, dtype: datetime64[ns]

Great, we see that it is being treated as a date variable now. Often we also want to convert object variables to categorical variables. You will do this in lab this week. 

Note that you can check the types of all of the varibles with the code:

In [123]:
twitter.dtypes

author                        object
content                       object
country                       object
date_time             datetime64[ns]
id                           float64
language                      object
latitude                     float64
longitude                    float64
number_of_likes                int64
number_of_shares               int64
likes_divby_shares           float64
dtype: object

We can create new variables from old variables from what we already know: 

In [124]:
twitter['likes_divby_shares'] = twitter['number_of_likes']/twitter['number_of_shares']

We have created a new variable likes_divby_shares that equals the  number times each tweet is liked divided by the number of times each tweet is shared. Because number_of_likes and number_of_shares are arrays (of the appropriate type), we can use vector operations to create a new column. Let's take a look:

In [125]:
twitter

Unnamed: 0,author,content,country,date_time,id,language,latitude,longitude,number_of_likes,number_of_shares,likes_divby_shares
0,katyperry,Is history repeating itself...?#DONTNORMALIZEH...,,2017-12-01 19:52:00,8.196330e+17,en,,,7900,3472,2.275346
1,katyperry,@barackobama Thank you for your incredible gra...,,2017-11-01 08:38:00,8.191010e+17,en,,,3689,1380,2.673188
2,katyperry,Life goals. https://t.co/XIn1qKMKQl,,2017-11-01 02:52:00,8.190140e+17,en,,,10341,2387,4.332216
3,katyperry,Me right now 🙏🏻 https://t.co/gW55C1wrwd,,2017-11-01 02:44:00,8.190120e+17,en,,,10774,2458,4.383238
4,katyperry,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,,2017-10-01 05:22:00,8.186890e+17,en,,,17620,4655,3.785177
...,...,...,...,...,...,...,...,...,...,...,...
52537,ddlovato,Life couldn't be better right now. 😊,,2015-06-01 23:10:00,5.526030e+17,en,,,32799,23796,1.378341
52538,ddlovato,First Monday back in action. I'd say 21.6 mile...,,2015-06-01 02:17:00,5.522880e+17,en,,,21709,12511,1.735193
52539,ddlovato,"Crime shows, buddy, snuggles = the perfect Sun...",,2015-05-01 03:42:00,5.519470e+17,en,,,25269,15583,1.621575
52540,ddlovato,❄️ http://t.co/sHCFdPpGPa,,2015-05-01 00:06:00,5.518920e+17,und,,,15985,10456,1.528787


In [128]:
# We can also compute the mean of variables with code like: 

twitter['number_of_likes'].mean()

9637.838338852727

# Part 4: Functions


In [9]:
twitter["length"] = twitter["content"].str.len()

In [14]:
sum(twitter["length"])

283037494