# FUNDAMENTALS OF DATA ANALYSIS WITH PYTHON <br><font color="crimson">DAY 3: SOCIAL SCIENTIFIC COMPUTING WITH PANDAS</font>

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>

### Overview 

This notebook introduces some fundamentals of scientific computing with `Pandas` and `matplotlib`. `Pandas` is an extremely popular Python package for storing, manipulating, and analyzing data in a tabular form, with rows and columns. We will learn how to get data into `pandas`, and then how to perform common data analysis tasks such as selecting columns, filtering rows, and computing descriptive statistics. Then we will learn how to use `matplotlib` for producing high-quality plots for print or the web. We will use it to create a variety of common statistical plots and other visualizations. 

### Plan for the Day

1. [`Pandas` 101](#pandas)
2. [Best practices for `Pandas`](#pandasbp)
3. [Open Work Time](#open)

<hr>

In [130]:
import os
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'svg' # better resolution with vector graphics! 

# `Pandas` 101<a id='pandas'></a>

Quantitative or computational social scientists are used to working with data in tabular form, such as a `dataframe` with variables in the columns and observations in the rows. In Python, the `Pandas` package enables us to organize, manipulate, and analyze data in this familiar way. 

`Pandas` is an extremely popular package in the scientific computing community regardless of the discipline (physics, sociology, neuroscience, history) or industry (academia, government, industry). It was originally developed for time series analysis. It gets it's name from **pan**el **da**ta. 

This part of the notebook covers some essential functionality of `Pandas` that you will make heavy use of in most data analyses. Of course, we will not cover *everything* that is possible to do with `Pandas`. As with the previous content, the goal is to build a basic foundation that we can build on throughout the week. We will emphasize the functionality that you take you the furthest in any give data analysis project. 

## Reading Data from Files 

`Pandas` makes it easy to load data from an external file directly into a `DataFrame`, which will discuss momentarily. It does so using one of many `reader` functions that are part of a suite of `I/O` (input / output, read / write) tools. For some common examples, see the table below. Information on these and other `reader` functions can be found in the `pandas` [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). The [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) also provides useful information about the parameters for these methods, such as how to specify what sheet you want from an Excel spreadsheet, or whether to write the index to a new `csv` file. 



| Data Description                | Reader          | Writer        |
|:--------------------------------|:----------------|:--------------|
| CSV                             | `read_csv()`   | `to_csv()`   |
| JSON                            | `read_json()`  | `to_json()`  |
| MS Excel and OpenDocument (ODF) | `read_excel()` | `to_excel()` |
| Stata                           | `read_stata()` | `to_stata()` |
| SAS                             | `read_sas()`   | NA            |
| SPSS                            | `read_spss()`  | NA            |


To illustrate how these `reader` functions work, we will use the `read_csv()` function. The only *required* argument is that we provide the path to the location of the file on our computer. 

In this case, we will use the ["Three Million Russian Trolls" dataset](https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/), which consists of data on ~3M tweets from Twitter accounts that are known to be part of state-sponsored disinformation campaigns. This particular dataset was collected and coded by Darrin Linvill and Patrick Warren, of Clemson University. It includes several variables that were hand coded by Linvill and Warren, the most important of which are classifications of accounts into different types. 

The dataset is stored in 12 different `csv` files. They are stored in a directory called `russian-troll-tweets`, which is inside the `data` directory.

In [6]:
!ls data/russian-troll-tweets

IRAhandle_tweets_10.csv  IRAhandle_tweets_2.csv  IRAhandle_tweets_7.csv
IRAhandle_tweets_11.csv  IRAhandle_tweets_3.csv  IRAhandle_tweets_8.csv
IRAhandle_tweets_12.csv  IRAhandle_tweets_4.csv  IRAhandle_tweets_9.csv
IRAhandle_tweets_13.csv  IRAhandle_tweets_5.csv  README.md
IRAhandle_tweets_1.csv	 IRAhandle_tweets_6.csv


Let's start by loading just one of the files. Later we will see how to read in all 12 and combine them into 1 large dataset. 

In [33]:
df = pd.read_csv('data/russian-troll-tweets/IRAhandle_tweets_1.csv')

By default, `pandas` assumes your data is encoded with `UTF-8`. If you see an encoding error, you can switch to a different encoding, such as `latin`.

Once we have our `dataframe`, we can use the `info()` method to see the name of each column, as well as it's integer index and datatype. 

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243891 entries, 0 to 243890
Data columns (total 21 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   external_author_id  243891 non-null  int64 
 1   author              243891 non-null  object
 2   content             243891 non-null  object
 3   region              243853 non-null  object
 4   language            243891 non-null  object
 5   publish_date        243891 non-null  object
 6   harvested_date      243891 non-null  object
 7   following           243891 non-null  int64 
 8   followers           243891 non-null  int64 
 9   updates             243891 non-null  int64 
 10  post_type           154592 non-null  object
 11  account_type        243891 non-null  object
 12  retweet             243891 non-null  int64 
 13  account_category    243891 non-null  object
 14  new_june_2018       243891 non-null  int64 
 15  alt_external_id     243891 non-null  int64 
 16  tw

We now have a `dataframe` with 21 variables. The `dataframe` is organized the way we would expect: with variables in the columns and observations in the columns. We can use the `.head()` method to preview the top $n$ rows of the dataset. 

In [14]:
df.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,906000000000000000,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,...,Right,0,RightTroll,0,905874659358453760,914580356430536707,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/914580356430...,,
1,906000000000000000,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,...,Right,0,RightTroll,0,905874659358453760,914621840496189440,http://twitter.com/905874659358453760/statuses...,https://twitter.com/damienwoody/status/9145685...,,
2,906000000000000000,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,10/1/2017 22:50,10/1/2017 22:51,1054,9637,255,...,Right,1,RightTroll,0,905874659358453760,914623490375979008,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/913231923715...,,
3,906000000000000000,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,English,10/1/2017 23:52,10/1/2017 23:52,1062,9642,256,...,Right,0,RightTroll,0,905874659358453760,914639143690555392,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/914639143690...,,
4,906000000000000000,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,English,10/1/2017 2:13,10/1/2017 2:13,1050,9645,246,...,Right,1,RightTroll,0,905874659358453760,914312219952861184,http://twitter.com/905874659358453760/statuses...,https://twitter.com/realDonaldTrump/status/914...,,
5,906000000000000000,10_GOP,"Dan Bongino: ""Nobody trolls liberals better th...",Unknown,English,10/1/2017 2:47,10/1/2017 2:47,1050,9644,247,...,Right,0,RightTroll,0,905874659358453760,914320835325853696,http://twitter.com/905874659358453760/statuses...,https://twitter.com/FoxNews/status/91423949678...,,
6,906000000000000000,10_GOP,🐝🐝🐝 https://t.co/MorL3AQW0z,Unknown,English,10/1/2017 2:48,10/1/2017 2:48,1050,9644,248,...,Right,1,RightTroll,0,905874659358453760,914321156466933760,http://twitter.com/905874659358453760/statuses...,https://twitter.com/Cernovich/status/914314644...,,
7,906000000000000000,10_GOP,'@SenatorMenendez @CarmenYulinCruz Doesn't mat...,Unknown,English,10/1/2017 2:52,10/1/2017 2:53,1050,9644,249,...,Right,0,RightTroll,0,905874659358453760,914322215537119234,http://twitter.com/905874659358453760/statuses...,,,
8,906000000000000000,10_GOP,"As much as I hate promoting CNN article, here ...",Unknown,English,10/1/2017 3:47,10/1/2017 3:47,1050,9646,250,...,Right,0,RightTroll,0,905874659358453760,914335818503933957,http://twitter.com/905874659358453760/statuses...,http://www.cnn.com/2017/09/27/us/puerto-rico-a...,,
9,906000000000000000,10_GOP,After the 'genocide' remark from San Juan Mayo...,Unknown,English,10/1/2017 3:51,10/1/2017 3:51,1050,9646,251,...,Right,0,RightTroll,0,905874659358453760,914336862730375170,http://twitter.com/905874659358453760/statuses...,,,


Alternatively, we could use the `.sample()` method to pull a random sample of $n$ observations, which can be helpful if we don't want the observations we preview to be from the top (`head`) or bottom (`tail`) of the dataset.

In [36]:
df.sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
234174,2944766250,ATLANTA_ONLINE,Nathan Deal bets on Supreme Court expansion ht...,United States,English,1/15/2016 11:31,1/15/2016 11:32,7233,14879,7748,...,local,0,NewsFeed,0,2944766250,687960398293671936,http://twitter.com/Atlanta_Online/statuses/687...,https://twibble.io,http://twib.in/l/7A5XLBaLxpn,
103740,1679279490,AMELIEBALDWIN,"On Thanksgiving Week, Native Americans Are Bei...",United States,English,11/23/2016 19:25,11/23/2016 19:25,2371,2559,18958,...,Right,1,RightTroll,0,1679279490,801506948998893568,http://twitter.com/1679279490/statuses/8015069...,http://m.huffpost.com/us/entry/us_583496a3e4b0...,,
52055,3082043421,ALBERTMORENMORE,Love doesn't just sit there like a stone,United States,English,4/24/2015 16:10,4/24/2015 16:10,142,73,190,...,Right,0,RightTroll,0,3082043421,591635351086567424,http://twitter.com/AlbertMoreNMore/statuses/59...,,,
180696,2514605158,ANNY_DUBI,Космонавт Олег Кононенко с борта МКС научился ...,Unknown,Russian,8/20/2015 19:51,8/20/2015 19:51,135,126,1877,...,Russian,1,NonEnglish,1,2514605158,634452696616038401,http://twitter.com/Anny_dubi/statuses/63445269...,,,
109319,1679279490,AMELIEBALDWIN,In that Boeing speech that might've set off th...,United States,English,12/7/2016 5:48,12/7/2016 5:48,2365,2593,20754,...,Right,1,RightTroll,0,1679279490,806374764629598208,http://twitter.com/1679279490/statuses/8063747...,,,
202915,1686370159,ARCHIEOLIVERS,myDevices launches end-to-end IoT platform htt...,United States,English,10/12/2015 20:23,10/12/2015 20:23,510,212,846,...,Right,1,RightTroll,0,1686370159,653667323295825920,http://twitter.com/ArchieOlivers/statuses/6536...,http://hubs.ly/H01gR0w0,,
115818,1679279490,AMELIEBALDWIN,Nothing is more disturbing than 'serial child ...,United States,English,3/31/2017 18:23,3/31/2017 18:23,2302,2763,34696,...,Right,1,RightTroll,0,1679279490,847877034307145729,http://twitter.com/1679279490/statuses/8478770...,https://twitter.com/RedNationRising/status/847...,,
125982,1679279490,AMELIEBALDWIN,"Because the MSM won't report it, I will. 2nd A...",United States,English,9/18/2016 14:58,9/18/2016 14:58,1479,2228,8489,...,Right,1,RightTroll,0,1679279490,777522095181864960,http://twitter.com/AmelieBaldwin/statuses/7775...,,,
17237,893000000000000000,ABIISSROSB,#abi Look What Liberals Wrote on this Philly S...,United States,English,8/18/2017 17:24,8/18/2017 17:24,1957,896,788,...,Right,0,RightTroll,0,893352343277887488,898596475600478208,http://twitter.com/893352343277887488/statuses...,https://twitter.com/abiissrosb/status/89859647...,http://ift.tt/2xba7qA,
164326,895000000000000000,ANIIANTRS,RT RightlyNews: Trump just clearly and strongl...,Unknown,English,8/13/2017 3:13,8/13/2017 3:14,38,5,622,...,Right,0,RightTroll,0,894845726840283136,896570298400747521,http://twitter.com/894845726840283136/statuses...,https://twitter.com/i/web/status/8964722290761...,,


To load up the full dataset -- which is spread across 12 files -- we can read in each `csv` file and concatenate them all into a single `dataframe`. Note that if your data is contained in a single file, this step would not be necessary. 

In [37]:
data_dir = os.listdir('data/russian-troll-tweets')
data_dir

['README.md',
 'IRAhandle_tweets_10.csv',
 'IRAhandle_tweets_11.csv',
 'IRAhandle_tweets_5.csv',
 '.git',
 'IRAhandle_tweets_7.csv',
 'IRAhandle_tweets_8.csv',
 'IRAhandle_tweets_6.csv',
 'IRAhandle_tweets_9.csv',
 'IRAhandle_tweets_2.csv',
 'IRAhandle_tweets_3.csv',
 'IRAhandle_tweets_1.csv',
 'IRAhandle_tweets_13.csv',
 'IRAhandle_tweets_4.csv',
 'IRAhandle_tweets_12.csv']

In [38]:
files = [f for f in data_dir if 'csv' in f]
files 

['IRAhandle_tweets_10.csv',
 'IRAhandle_tweets_11.csv',
 'IRAhandle_tweets_5.csv',
 'IRAhandle_tweets_7.csv',
 'IRAhandle_tweets_8.csv',
 'IRAhandle_tweets_6.csv',
 'IRAhandle_tweets_9.csv',
 'IRAhandle_tweets_2.csv',
 'IRAhandle_tweets_3.csv',
 'IRAhandle_tweets_1.csv',
 'IRAhandle_tweets_13.csv',
 'IRAhandle_tweets_4.csv',
 'IRAhandle_tweets_12.csv']

We will overwrite the `df` created earlier. 

In [39]:
df = pd.concat((pd.read_csv('data/russian-troll-tweets/{}'.format(f), encoding='utf-8', low_memory=False) for f in files if 'csv' in f))
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2946207 entries, 0 to 239349
Data columns (total 21 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   external_author_id  object
 1   author              object
 2   content             object
 3   region              object
 4   language            object
 5   publish_date        object
 6   harvested_date      object
 7   following           int64 
 8   followers           int64 
 9   updates             int64 
 10  post_type           object
 11  account_type        object
 12  retweet             int64 
 13  account_category    object
 14  new_june_2018       int64 
 15  alt_external_id     object
 16  tweet_id            int64 
 17  article_url         object
 18  tco1_step1          object
 19  tco2_step1          object
 20  tco3_step1          object
dtypes: int64(6), object(15)
memory usage: 494.5+ MB


In this case, we have two datatypes in our `dataframe`: `object` and `int64`. `Pandas` uses `object` to refer to columns that contain `strings`, or which contain mixed types, such as `strings` and `integers`. In this case, they refer to `strings`. `int64` are integers. In addition to these two data types, `pandas` stores `floats` (`float64`), booleans (True or False), several specialized `datetime` data structures, and categorical variables.  

One further thing to note about this dataset: **each row is a tweet from a specific account, but some of the variables describe attributes of the tweeting accounts, not of the tweet itself**. For example, `followers` describes the number of followers that the account had at the time it sent the tweet. This makes sense, because tweets don't have followers, but accounts do. We need to keep this in mind when working with this dataset. 

## Understanding `Pandas` Data Structures 

Now that we have a `dataframe` loaded into memory, we can move on to some interesting data analyses. But first, let's devote a bit of time to clarifying `pandas` data structures. 

### Background Knowledge

> Note: feel free to temporarily skip over this "background knowledge" section if you are feeling overwhelmed with new information. It is useful to know, but it is not *essential* knowledge for using Pandas to analyze data. You will not lose much if you come back to this at some point in the future, when you are more comfortable with basic `pandas` data structures and operations. 

First, some background knowledge. Python is a dynamically typed language. What that means is that you don't need to constantly tell Python what kind of object something is. For example, if you add two numbers together

In [124]:
42 + 8

50

it is not necessary to tell Python that `42` and `8` are integers. Instead, Python stores that metadata in each object. 

When we store data in a list, every element in the list is actually a Python object itself, containing not only the actual data itself (e.g. `42`), but also information about the **type** of data that it is, which in this case is `int64`. This is enormously useful in many cases, because we store objects of different types in a `list`.

In [41]:
some_data = [42, 8, 'a string']
print(some_data)

[42, 8, 'a string']


In this example, each element in `some_data` also contains information about the type of object it is. As previously mentioned, this is enormously helpful in some contexts, but dramatically slows down computation in other contexts. Data analysis is one example of where, depending on what you are trying to do, dynamic typing can slow things down rather a lot. 

When you are analyzing data, you are almost always working with some collection of elements that are all of the same type, such as integers, floats, strings, or Boolean values. For example, you can't compute the mean and standard deviation of a collection of elements that include both integers and strings. So it follows that data analysis can be made more efficient by working on data structures where information about data types is stored at the level of the collection itself rather than each element in the collection, *provided the data is all of the same type*. 

One of the main tools for doing this in `Python` is the `numpy` package, which is more or less the foundation of all data analysis in `Python`, whether you explicitly use it or not. `numpy` provides data structures for working with `arrays` of data that are a bit like lists except that all elements are of the same type, information about that type is stored at the level of the `array` itself, and each element in the `array` has an explicit integer index. `arrays` can be one dimensional vectors or multi-dimensional matrices. 

Further discussion of `numpy` is beyond the scope of this class. For our purposes here, what you need to know is that `pandas` builds on top of `numpy` and offers an additional set of data structures and methods that are designed explicitly to meet the needs to researchers working with real-world empirical data. Like `numpy`, `pandas` is designed to make scientific computing more efficient, but as we will learn below there are some common pitfalls to avoid that, if you are not careful, can actually make working with `pandas` slow and inefficient. 

> **Jillian**, where do you think we should add some information about `axes` for `dataframes`? We only need a sentence or two I think, but I am not sure where the best place to put it is. 

## Back to Essential `Pandas`: `Series` and `index`

Each column in a `dataframe` is an object called a `series`. A `series` is a one-dimensional object, such as a vector of numbers. However, that vector is associated with an `index`, which is a vector, or array, of labels. 

For example, the column `retweet` in our Russian troll `dataframe` is a `series` of integers (number of times a tweet was retweeted) and their `index` labels. 

In [125]:
num_followers = df['followers']
type(num_followers)

pandas.core.series.Series

Below, we pull a sample of 25 tweets from the `series`. The value on the left is the index label for the observation, the number on the right is the actual data value (number of retweets). The index values are sequential in the actual `series`, but they are out of sequence here because we pulled a random sample. 

In [126]:
num_followers.sample(25)

129063      802
226691      180
207474      630
67866      2003
33136       268
150258    20234
222897     2607
21034     16576
123620    19238
164287      318
72063       646
22130       789
100625     2393
184266    27688
67744       800
173869      288
137798       23
33858     25403
120106     1326
247893     2383
159423    12278
118832      224
7475       1665
145268      157
163800     2140
Name: followers, dtype: int64

In most cases, the default `index` for a `series` or `dataframe` is an immutable vector of integers:

In [135]:
num_followers.index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            239340, 239341, 239342, 239343, 239344, 239345, 239346, 239347,
            239348, 239349],
           dtype='int64', length=2946207)

In some cases, such as time series analysis, the `index` might default to a `DatetimeIndex` or a `PeriodIndex`, but we will not consider those in this course. If you are working with time series data, the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) provides explanations of how to use these types of `indices`.

We can easily modify an `index` so that it is made of up some other type of vector instead, including a `string`. Surprisingly, `index` values do not need to be unique (technically, they are a `multiset`, or a `set` that is allowed to have repeat elements). This enables us to do some powerful things, but most of the time you should avoid manually changing indexes. 

We can use the `index` to retrieve specific values from a `series` much as we would if we were selecting an element from a `list`, `tuple`, or `array`.

### Operations on `Series`: Descriptive Statistics

As we will soon see, there are a number of operations we can perform on `series`, such as simple descriptive statistics like mean, median, mode, and standard deviation.

In [138]:
print('Median ', num_followers.median())
print('Mean ', num_followers.mean())
print('Standard Deviation ', num_followers.std())

Median  1274.0
Mean  7055.265491868019
Standard Deviation  14635.939344600854


Since the values returned from operations on `Series` are essentially equivalent to a `numpy` `array`, we can use `numpy` methods on `series` objects. For example, we can use the `round()` method from `numpy` to round these descriptives to a few decimal points. 

In [139]:
print('Median ', np.round(num_followers.median(), 3))
print('Mean ', np.round(num_followers.mean(), 3))
print('Standard Deviation ', np.round(num_followers.std(), 3))

Median  1274.0
Mean  7055.265
Standard Deviation  14635.939


We can also `count` the number of non-missing observations in a `series`

In [136]:
num_followers.count()

2946207

or get an overview of multiple descriptives at once:

In [137]:
num_followers.describe()

count    2.946207e+06
mean     7.055265e+03
std      1.463594e+04
min     -1.000000e+00
25%      3.220000e+02
50%      1.274000e+03
75%      1.085300e+04
max      2.512760e+05
Name: followers, dtype: float64

If our series is categorical, we can also easily compute useful information such as the number of unique categories, the size of each category, and so on. For example, let's look at the `account_type` `series`.

In [94]:
atype = df['account_type']

In [95]:
atype.unique()

array(['Russian', 'Right', 'news', '?', 'Koch', 'Hashtager', 'Left',
       'German', 'Arabic', 'local', 'Italian', 'Ebola ', 'French',
       'Commercial', 'Uzbek', 'Spanish', 'Ukranian', 'ZAPOROSHIA',
       'Portuguese'], dtype=object)

In [96]:
atype.value_counts()

Right         711668
Russian       704917
local         459220
Left          427141
Hashtager     241786
news          139006
Commercial    121904
German         91511
Italian        15680
?              13539
Koch           10894
Arabic          6228
Spanish         1226
French          1117
ZAPOROSHIA       175
Portuguese       118
Ebola             71
Ukranian           4
Uzbek              2
Name: account_type, dtype: int64

Later, we will consider some summary statistics for pairs of `series`, such as computing correlations and covariance. 

## DataFrames

We already have our `dataframe` loaded into memory (as `df`), but so far all we have used it for is pulling out individual `series`. This is easy to do in part because `dataframes` are themselves just collections of `series` that are aligned on the same `index` values. In other words, both `series` we worked with previously -- `atype` and `num_followers` -- have their own `indexes` when we work with them as `series`, but in a `dataframe`, they share an index. `dataframes` are organized the way we would expect: with variables in the columns and observations in the rows. We can use the `.head()` method to preview the top $n$ rows of the dataset. 

In [101]:
df.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,2260338140,POLITICS_T0DAY,https://t.co/9OgJ5RxUEV,United States,Russian,2/16/2016 23:15,2/16/2016 23:16,92,887,12939,...,Russian,0,NonEnglish,0,2260338140,699733931055259648,http://twitter.com/politics_t0day/statuses/699...,https://twitter.com/politics_t0day/status/6997...,,
1,2260338140,POLITICS_T0DAY,Пять этажей жилого дома рухнули в Ярославле по...,United States,Russian,2/16/2016 5:41,2/16/2016 5:42,92,884,12895,...,Russian,0,NonEnglish,0,2260338140,699468718888390656,http://twitter.com/politics_t0day/statuses/699...,https://youtu.be/STxTIceQmsA,,
2,2260338140,POLITICS_T0DAY,Вербовщика Джихади Джона нашли в Турции через ...,United States,Russian,2/16/2016 6:10,2/16/2016 6:10,92,884,12896,...,Russian,0,NonEnglish,0,2260338140,699476018063659008,http://twitter.com/politics_t0day/statuses/699...,https://youtu.be/xy5ap3xX_fs,,
3,2260338140,POLITICS_T0DAY,"""Война"" с Евгением Поддубным от 14.02.16 https...",United States,Russian,2/16/2016 6:36,2/16/2016 6:36,92,885,12897,...,Russian,0,NonEnglish,0,2260338140,699482457377210372,http://twitter.com/politics_t0day/statuses/699...,https://youtu.be/a7x4v7CYHiA,,
4,2260338140,POLITICS_T0DAY,Посол #САР в #РФ обвинил #США в авиаударах по ...,United States,Russian,2/16/2016 7:01,2/16/2016 7:01,92,885,12898,...,Russian,0,NonEnglish,0,2260338140,699488793993330688,http://twitter.com/politics_t0day/statuses/699...,https://twitter.com/politics_t0day/status/6994...,https://vk.com/wall-62675857_176796,
5,2260338140,POLITICS_T0DAY,В 2016 году начинается масштабная #модернизаци...,United States,Russian,2/16/2016 7:02,2/16/2016 7:02,92,885,12899,...,Russian,0,NonEnglish,0,2260338140,699488942958239744,http://twitter.com/politics_t0day/statuses/699...,https://twitter.com/politics_t0day/status/6994...,https://vk.com/wall-62675857_176809,
6,2260338140,POLITICS_T0DAY,Предательство #СССР. #Перестройка #Хрущёв'а. ...,United States,Russian,2/16/2016 7:02,2/16/2016 7:02,92,885,12900,...,Russian,0,NonEnglish,0,2260338140,699489089805012992,http://twitter.com/politics_t0day/statuses/699...,https://twitter.com/politics_t0day/status/6994...,https://vk.com/wall-62675857_176845,
7,2260338140,POLITICS_T0DAY,"Телеканал ""#Россия"" покажет #фильм-расследован...",United States,Russian,2/16/2016 7:03,2/16/2016 7:03,92,885,12901,...,Russian,0,NonEnglish,0,2260338140,699489262358700032,http://twitter.com/politics_t0day/statuses/699...,https://twitter.com/politics_t0day/status/6994...,https://vk.com/wall-62675857_176863,
8,2260338140,POLITICS_T0DAY,#Поклонская вручила руководству #меджлис'а пре...,United States,Russian,2/16/2016 7:04,2/16/2016 7:04,92,885,12903,...,Russian,0,NonEnglish,0,2260338140,699489549815324674,http://twitter.com/politics_t0day/statuses/699...,https://twitter.com/politics_t0day/status/6994...,https://vk.com/wall-62675857_176892,
9,2260338140,POLITICS_T0DAY,Обиженная #Турция может спровоцировать третью ...,United States,Russian,2/16/2016 7:04,2/16/2016 7:04,92,885,12902,...,Russian,0,NonEnglish,0,2260338140,699489393501933568,http://twitter.com/politics_t0day/statuses/699...,https://twitter.com/politics_t0day/status/6994...,https://vk.com/wall-62675857_176880,


Alternatively, we could use the `.sample()` method to pull a random sample of $n$ observations, which can be helpful if we don't want the observations we preview to be from the top (`head`) or bottom (`tail`) of the dataset.

In [102]:
df.sample(5)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
66787,1877492857,CHESPLAYSCHESS,Rowan County Oath Keepers Support And Defend...,United States,English,6/6/2015 14:51,6/6/2015 14:51,74,151,4196,...,Right,1,RightTroll,0,1877492857,607198135643836416,http://twitter.com/ChesPlaysChess/statuses/607...,http://ln.is/www.youtube.com/l5cMi,,
201134,1655815544,JEANUTTELLA,Gang Starr - Full Clip One Of The Best Yet! ...,United States,English,12/27/2016 0:59,12/27/2016 0:59,936,988,3183,...,Left,1,LeftTroll,0,1655815544,813549838314962944,http://twitter.com/1655815544/statuses/8135498...,http://fb.me/8kwRow2an,,
222163,2624554209,DAILYLOSANGELES,Southwest flight from Austin to Chicago redire...,United States,English,7/12/2017 19:46,7/12/2017 19:46,9073,19054,17857,...,local,0,NewsFeed,0,2624554209,885223814556250112,http://twitter.com/2624554209/statuses/8852238...,https://twitter.com/DailyLosAngeles/status/885...,http://www.foxla.com/news/local-news/267401940...,
118000,1691000604,ROBERTEBONYKING,'@Travistritt This was no usual election. Some...,United States,English,1/10/2017 7:15,1/10/2017 7:15,706,725,2632,...,Left,1,LeftTroll,0,1691000604,818717913293189121,http://twitter.com/1691000604/statuses/8187179...,,,
46179,3272640600,EXQUOTE,http://t.co/RSxVPUCnvV Ab workout bout killed ...,United States,English,8/1/2015 4:39,8/1/2015 4:39,2,350,27598,...,Commercial,0,Commercial,1,3272640600,627337804209635328,http://twitter.com/ExQuote/statuses/6273378042...,https://twitter.com/safety/unsafe_link_warning...,,


When working with a `dataframe`, we can select subsets of data by selecting columns or filtering rows. Let's look at selecting columns first. 

### Selecting Columns 

Earlier, we saw how we could select a single column using by specifying the name of the `dataframe` followed by the name of the `series` inside square brackets and straight quotes. 

In [118]:
followers = df['followers']
followers.sample(10)

9898        875
242562    16758
137842      150
56819       109
199628      323
246943      923
8627        147
82523       494
119298     7065
40797       220
Name: followers, dtype: int64

We can select multiple columns by passing a list of column names. Whereas the result of the previous selection was a `series` (because we only pulled one column), selecting multiple columns will return a `dataframe` containing only the requested columns. 

In [117]:
ff = df[['followers', 'following']]
ff.sample(10)

Unnamed: 0,followers,following
48542,817,1543
240694,61,289
192372,26686,4994
81849,19,40
179268,775,311
185875,464,321
215622,13279,14363
241059,15662,10917
118331,721,708
154695,1087,1198


This kind of subsetting can be very helpful when, for example, you are working with datasets that have a lot of columns, only some of which are required for your analysis. 

### Filtering Rows 

It is also sometimes necessary to filter rows. There are a variety of ways to do this, including slices (e.g. all observations between index $i_i$ and index $i_j$). In a data analysis context, most of the row filtering you will do is likely to be based on some sort of explicit condition, such as "give me all the observations with more than 1,000 followers." Most likely, you will only filter rows based on subsets if you are selecting the first $n$ rows of a `dataframe` that has been sorted by the values of some `series`. We will consider this case later. 

In [114]:
df[df['followers'] >= 1000].sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
164405,3969530725,PATRIOTOTUS,America is facing a financial crisis once agai...,United States,English,12/30/2015 23:44,12/30/2015 23:45,4837,3869,646,...,Right,0,RightTroll,0,3969530725,682346660027088896,http://twitter.com/patriototus/statuses/682346...,https://twitter.com/patriototus/status/6823466...,,
187890,789000000000000000,WORLDNEWSPOLI,Accelerationism: how a fringe philosophy predi...,United States,English,5/11/2017 5:57,5/11/2017 5:57,4264,3014,28183,...,Right,0,RightTroll,0,789266125485998080,862547107558825984,http://twitter.com/789266125485998080/statuses...,https://twitter.com/WorldnewsPoli/status/86254...,https://www.theguardian.com/world/2017/may/11/...,
45866,1671234620,HYDDROX,"Heb 13:15 ""Therefore, let us offer through Jes...",United States,English,10/21/2016 22:56,10/21/2016 22:56,2159,2222,10169,...,Right,1,RightTroll,0,1671234620,789601285116981249,http://twitter.com/hyddrox/statuses/7896012851...,,,
74242,2951556370,SPECIALAFFAIR,Pakistan blasphemy killer's supporters clash w...,United States,English,3/27/2016 17:47,3/27/2016 17:47,10234,11253,22299,...,news,0,NewsFeed,0,2951556370,714146864078000128,http://twitter.com/SpecialAffair/statuses/7141...,,,
173537,743167000000000000,COVFEFENATIONUS,'@pick4guy @perfectsliders @rsirrobbie WE PRAY...,United States,English,11/26/2017 19:58,11/26/2017 19:58,247,2503,154554,...,Right,1,RightTroll,1,743166519157227520,934873979931652097,http://twitter.com/743166519157227520/statuses...,,,
49741,1671234620,HYDDROX,Proof of target practice with my new pistol. I...,United States,English,12/19/2016 21:25,12/19/2016 21:25,2593,2270,13319,...,Right,1,RightTroll,0,1671234620,810959271047397376,http://twitter.com/1671234620/statuses/8109592...,https://twitter.com/MansfieldWrites/status/810...,,
227881,870000000000000000,MARIALOPTRUMP,Putin jokes about offering political asylum to...,Unknown,English,6/15/2017 16:37,6/15/2017 16:37,2416,1153,1151,...,Right,0,RightTroll,0,869770503531180032,875391840505257984,http://twitter.com/869770503531180032/statuses...,https://twitter.com/EissyT56T/status/875391299...,https://twitter.com/safety/unsafe_link_warning...,
214177,3438999494,WORLDOFHASHTAGS,Dances with PoPO #policeamovie,United States,English,7/18/2016 13:38,7/18/2016 13:38,3863,4956,12924,...,Hashtager,1,HashtagGamer,0,3438999494,755033916898062336,http://twitter.com/WorldOfHashtags/statuses/75...,,,
86450,2570574680,RIAFANRU,Ирак: США не верят в быстрое освобождение стра...,Belarus,Russian,1/25/2017 1:45,1/25/2017 1:45,2610,11977,70897,...,Russian,0,NonEnglish,0,2570574680,824070601346260998,http://twitter.com/2570574680/statuses/8240706...,https://twitter.com/riafanru/status/8240706013...,https://riafan.ru/598865-irak-ssha-ne-veryat-v...,
30042,2601235821,TODAYPITTSBURGH,Teen killed in East Hills shooting #local,United States,English,5/18/2015 10:32,5/18/2015 10:32,5971,12149,8167,...,local,0,NewsFeed,0,2601235821,600247490231676928,http://twitter.com/TodayPittsburgh/statuses/60...,,,


Alternatively, we could filter based on membership in some category, such as being a `RightTroll` or `LeftTroll` account. `RightTroll` and `LeftTroll` are attributes of the `account_category` `series`. Let's get `RightTroll` accounts. 

In [115]:
df[df['account_category'] == 'RightTroll'].sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
162244,743167000000000000,COVFEFENATIONUS,Things Ted will put up with: 🍕 Serial rapist'...,United States,English,11/10/2017 17:01,11/10/2017 17:02,246,2085,142590,...,Right,1,RightTroll,1,743166519157227520,929031259346448384,http://twitter.com/743166519157227520/statuses...,https://twitter.com/tedlieu/status/92901614298...,,
101327,1679279490,AMELIEBALDWIN,'@BernieSanders @HillaryClinton Is this meant ...,United States,English,10/6/2016 22:13,10/6/2016 22:13,1944,2427,12700,...,Right,1,RightTroll,0,1679279490,784154552216027136,http://twitter.com/AmelieBaldwin/statuses/7841...,,,
188671,789000000000000000,WORLDNEWSPOLI,Kellyanne Conway blasts 'absurd' media claim t...,United States,English,5/16/2017 19:29,5/16/2017 19:29,4249,3022,29166,...,Right,0,RightTroll,0,789266125485998080,864563498155626496,http://twitter.com/789266125485998080/statuses...,https://twitter.com/WorldnewsPoli/status/86456...,http://www.washingtontimes.com/news/2017/may/1...,
171205,895000000000000000,ANNAMINGT,#minguu Rosie O’Donnell Denounces EVERYONE Who...,Unknown,English,8/16/2017 22:09,8/16/2017 22:09,22,1,949,...,Right,0,RightTroll,0,894837607061831681,897943414905257986,http://twitter.com/894837607061831681/statuses...,https://twitter.com/AnnamIngT/status/897943414...,http://ift.tt/2i6GuEr,
211708,1671936266,ARM_2_ALAN,This is a Fox News Alert: Wishing @MelissaAFra...,United States,English,6/10/2015 23:53,6/10/2015 23:53,68,167,7895,...,Right,1,RightTroll,0,1671936266,608784024358690817,http://twitter.com/Arm_2_Alan/statuses/6087840...,https://twitter.com/AndreaTantaros/status/6087...,,
164764,1833223908,DOROTHIEBELL,FLASHBACK: Black Supremacists Endorse Candidat...,United States,English,9/16/2016 18:15,9/16/2016 18:15,1846,1518,2048,...,Right,1,RightTroll,0,1833223908,776847060691775490,http://twitter.com/DorothieBell/statuses/77684...,http://bit.ly/2cmhO3t,,
67748,3084360275,MIL0BLAKE,14 Year-Old Belgian Muslim Videotapes Migrants...,United States,English,2/6/2016 14:25,2/6/2016 14:25,1028,414,1042,...,Right,1,RightTroll,0,3084360275,695976684646957056,http://twitter.com/Mil0Blake/statuses/69597668...,https://shar.es/14WVIH,,
63400,1661246144,KATERITTERRRR,WikiLeaks Daily Stream of 1000s of Podesta Ema...,United States,English,10/22/2016 15:22,10/22/2016 15:23,1320,1577,5339,...,Right,1,RightTroll,0,1661246144,789849450126671874,http://twitter.com/KateRitterrrr/statuses/7898...,https://twitter.com/TheTrumpLady/status/789834...,,
229526,871000000000000000,MARIATRUMPT,This U.S. State Will Offer “Not Specified” Gen...,Unknown,English,6/17/2017 23:04,6/17/2017 23:04,0,2,77,...,Right,0,RightTroll,0,870500630464299008,876213902748418048,http://twitter.com/870500630464299008/statuses...,https://twitter.com/MariaTRUMPt/status/8762139...,https://twitter.com/safety/unsafe_link_warning...,
61554,1877492857,CHESPLAYSCHESS,Union: Hackers have personnel data on every fe...,United States,English,6/11/2015 21:54,6/11/2015 21:54,74,168,8798,...,Right,1,RightTroll,0,1877492857,609116505653903360,http://twitter.com/ChesPlaysChess/statuses/609...,http://tiny.iavian.net/5iqp,,


We are left with a subset of 711,668 accounts (check yourself: `len(df[df['account_category'] == 'RightTroll'])`) that are classified as `RightTrolls`. 

# Aggregation and Grouped Operations

Some of the most common tasks in any given data analysis project involve some sort of aggregation or grouped operation. For example, we might want to compute and compare descriptive statistics for observations that take different values on a categorical variable. Let's see how to do that, and other grouped operations, with `pandas`. 

In brief, the `group_by()` method splits the `dataframe` into groups based on the values of a given variable. We can then perform operations on the resulting groups, such as computing descriptive statistics. 

In [142]:
grouped = df.groupby('account_category')
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

The code above returns a grouped object that we can work with. Let's say we want to pull out a specific group. We can use the `get_group()` method to pull a group from the grouped object. (Note that the `.get_group()` code below is equivalent to `df[df['account_type'] == 'RightTroll']`.) 

In [147]:
right_troll_group = grouped.get_group('RightTroll')
right_troll_group.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
17628,2912754262,POLITWEECS,It's another beautiful but breezy day! http://...,United States,English,4/1/2015 15:30,4/1/2015 15:30,712,188,1,...,Right,1,RightTroll,0,2912754262,583290244151140352,http://twitter.com/Politweecs/statuses/5832902...,https://twitter.com/JesseHawilaKCTV/status/583...,,
17629,2912754262,POLITWEECS,First Michiana business to publicly deny same-...,United States,English,4/1/2015 15:33,4/1/2015 15:33,712,188,2,...,Right,0,RightTroll,0,2912754262,583291010551169025,http://twitter.com/Politweecs/statuses/5832910...,,,
17630,2912754262,POLITWEECS,Obama for Prezident in 2016 #BadPrankIn5Words ...,United States,Croatian,4/1/2015 16:15,4/1/2015 16:16,712,188,3,...,Right,0,RightTroll,0,2912754262,583301633494249474,http://twitter.com/Politweecs/statuses/5833016...,https://twitter.com/Politweecs/status/58330163...,,
17631,2912754262,POLITWEECS,The U.S. and #Cuba have held their HIGHest-lev...,United States,English,4/10/2015 10:00,4/10/2015 10:00,589,5543,62,...,Right,0,RightTroll,0,2912754262,586468665920360448,http://twitter.com/Politweecs/statuses/5864686...,https://twitter.com/Politweecs/status/58646866...,,
17632,2912754262,POLITWEECS,Fox is about to end #TheSimpsons seasonal DVD ...,United States,English,4/10/2015 10:07,4/10/2015 10:07,589,5543,63,...,Right,0,RightTroll,0,2912754262,586470580724957185,http://twitter.com/Politweecs/statuses/5864705...,,,
17633,2912754262,POLITWEECS,Police horse named Jacob with a prestigious jo...,United States,English,4/10/2015 11:24,4/10/2015 11:24,589,5543,64,...,Right,0,RightTroll,0,2912754262,586489845469483008,http://twitter.com/Politweecs/statuses/5864898...,https://twitter.com/Politweecs/status/58648984...,,
17634,2912754262,POLITWEECS,This is an amazing Sun Cruise Resort hotel of ...,United States,English,4/10/2015 12:03,4/10/2015 12:03,589,5543,65,...,Right,0,RightTroll,0,2912754262,586499757113118721,http://twitter.com/Politweecs/statuses/5864997...,https://twitter.com/Politweecs/status/58649975...,,
17635,2912754262,POLITWEECS,Being #fat in middle age reduces risk of devel...,United States,English,4/10/2015 13:39,4/10/2015 13:39,589,5541,66,...,Right,0,RightTroll,0,2912754262,586523818929164290,http://twitter.com/Politweecs/statuses/5865238...,https://twitter.com/Politweecs/status/58652381...,,
17636,2912754262,POLITWEECS,#Obama approval rating among #Jews has gone fr...,United States,English,4/10/2015 15:24,4/10/2015 15:24,589,5536,67,...,Right,0,RightTroll,0,2912754262,586550375068536832,http://twitter.com/Politweecs/statuses/5865503...,https://twitter.com/Politweecs/status/58655037...,,
17637,2912754262,POLITWEECS,You can buy about 2000 #pizza s instead of gol...,United States,English,4/10/2015 16:24,4/10/2015 16:24,589,5535,68,...,Right,0,RightTroll,0,2912754262,586565440979144704,http://twitter.com/Politweecs/statuses/5865654...,,,


As previously mentioned, sometimes we want to compute some value for a group within the dataset. We can do this by specifying the grouped object, the `series` we want to perform an operation on, and finally the operation we want to perform. 

In [153]:
grouped['followers'].median()

account_category
Commercial        273
Fearmonger         48
HashtagGamer     2480
LeftTroll         836
NewsFeed        14722
NonEnglish        503
RightTroll       1437
Unknown           205
Name: followers, dtype: int64

In [154]:
grouped['following'].median()

account_category
Commercial         3
Fearmonger        65
HashtagGamer    2613
LeftTroll        796
NewsFeed        7089
NonEnglish       434
RightTroll      1864
Unknown          567
Name: following, dtype: int64

There are many things you can do here, such as comparing the ratio of followers to following. 

In [156]:
grouped['followers'].median() / grouped['following'].median()

account_category
Commercial      91.000000
Fearmonger       0.738462
HashtagGamer     0.949101
LeftTroll        1.050251
NewsFeed         2.076739
NonEnglish       1.158986
RightTroll       0.770923
Unknown          0.361552
dtype: float64

We can also perform some operations on the grouped object itself, such as computing the number of observations in each group, which in this case is equal to the number of tweets sent by accounts in each category. 

In [161]:
grouped.size().sort_values(ascending=False)

account_category
NonEnglish      820803
RightTroll      711668
NewsFeed        598226
LeftTroll       427141
HashtagGamer    241786
Commercial      121904
Unknown          13539
Fearmonger       11140
dtype: int64

It is also possible to group by multiple variables, such as `account_category` and `language`, and then perform an operation on the groups, such as compute the median number of followers. 

In [175]:
cat_lang = df.groupby(['account_category', 'language'], as_index=False)['followers'].median()
cat_lang.sample(30)

Unnamed: 0,account_category,language,followers
280,RightTroll,German,1833.0
206,NewsFeed,Russian,14209.0
267,RightTroll,Arabic,1324.5
134,LeftTroll,Croatian,821.5
213,NewsFeed,Turkish,19685.0
287,RightTroll,Indonesian,149.5
160,LeftTroll,Polish,876.0
73,Fearmonger,Norwegian,48.5
74,Fearmonger,Polish,53.0
162,LeftTroll,Pushto,92.0


Depending on what you are doing, the result of a grouped analysis like this could be a `series` or a `dataframe`. 

Finally, we can perform *multiple* operations on a grouped object by using the `agg()` function. 

* TODO... 

## Sorting and Ranking

Sorting and ranking observations based on some criteria is a common data analysis task. For example, we might want to know which accounts in our dataset have the most followers. 

> **Jillian**, I am still working on this. 

> TODO: cover tie breaking methods for sorts and ranks 

In [176]:
cat_lang.sort_values('followers', ascending = False)[:10]

Unnamed: 0,account_category,language,followers
242,NonEnglish,LANGUAGE UNDEFINED,26395.0
178,NewsFeed,Arabic,20700.0
213,NewsFeed,Turkish,19685.0
209,NewsFeed,Somali,19622.0
188,NewsFeed,Finnish,19035.5
292,RightTroll,LANGUAGE UNDEFINED,17448.0
215,NewsFeed,Uzbek,17026.0
216,NewsFeed,Vietnamese,16555.0
177,NewsFeed,Albanian,16110.0
183,NewsFeed,Danish,15565.0


In [181]:
# cat_lang['followers'].rank(method='max')

## Correlation and Covariance 

* Correlation matrix 

In [148]:
df.corr()

Unnamed: 0,following,followers,updates,retweet,new_june_2018,tweet_id
following,1.0,0.580259,0.15195,-0.305094,-0.150726,0.110589
followers,0.580259,1.0,0.233705,-0.312036,-0.049159,0.086571
updates,0.15195,0.233705,1.0,-0.17192,0.119216,0.14943
retweet,-0.305094,-0.312036,-0.17192,1.0,0.116437,-0.027388
new_june_2018,-0.150726,-0.049159,0.119216,0.116437,1.0,-0.353891
tweet_id,0.110589,0.086571,0.14943,-0.027388,-0.353891,1.0


In [150]:
df['followers'].corr(df['following'])

0.5802587806116529

* covariance 

# Dates and Times

> **Jillian** -- would be great to introduce this here if you have the bandwidth. Or I can later. I have started some time series plats in the next notebook and will be adding more tomorrow. If not, no big deal. 

## Selecting, Aggregating, Summarizing, and Subsetting <a id='sass'></a>

In `Pandas`, each individual column / variable is called a `Series`. To select a `Series` from our `dataframe`, we simply have to type the `Series` name inside square brackets following the name of the `dataframe` object itself. For example, to get the `Series` containing country names, we would type:

In [None]:
df['account_type']

We can perform simple operations on these individual `Series`, such as counting the number of observations per account type in the dataset. 

In [None]:
obs_by_type = df['account_type'].value_counts()
obs_by_type.sort_values(ascending = False)

We can create smaller `dataframes` by passing in a list of the `Series` we want to include, or by subsetting the observations based on some criteria. In the first case, we produce a smaller `dataframe` by selecting specific variables. In the second case we produce a smaller `dataframe` by selecting specific observations. For example, if we wanted to select the variables for following, followers, and account type, we could do the following: 

In [None]:
small = df[['following', 'followers', 'account_type']]
small.sort_values(['following'], ascending = False).head(40)

The second way we might want to subset our data is by selecting some subset of observations. For example, if we wanted to pull out a subset of our data for right troll accounts, we could do the following:

In [None]:
right_trolls = df[df['account_type'] == 'Right']
right_trolls.sample(10)

`Pandas` makes it easy to perform fundamental data operations, such as grouping by one variable and computing a mean value for another. Let's say, for example, that we wanted to group our dataset by account type and then get the median number of followers. (Note that I am not suggesting you should do this in a real data analysis, I am simply demonstrating `Pandas` functionality.) 

In [None]:
median_followers = df.groupby('account_type')['followers'].median().round(0)
median_followers.sort_values(ascending = False)[:35]

# Best practices for Pandas <a id='pandasbp'></a>

From [pandas](https://pandas.pydata.org/):
>pandas is a **fast**, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 

This is all true, with a pretty large caveat. Pandas is fast (and generally efficient), if you avoid some of the common pitfalls. Unfortunately, these traps are easy to fall for and many pandas users (even senior data scientists) don't know they might be slowing their code down 10-1000x. These people will often be hesitant to use pandas on large datasets and may dissuade others from using the library. 

However, by understanding a little about what is going on in the backend, we can avoid the worst of the problems and write relatively fast pandas code. 

[How is data stored in pandas?](#storage)   
[Efficient Transformation](#transform)   
[Efficient Initialization](#init)   

## How is data stored in Pandas? <a id='storage'></a>

<h3><font color='tomato'>### Series</font></h3>

### DataFrames
DataFrames are really just a collection of Series, with each column corresponding to its own Series. As we mentioned above, each item in a Series (or column) is stored right after the one before it. This means that the entire column is stored within a single range of memory.

However, the multiple Series (columns) that make a DataFrame can be stored anywhere in memory and are often not stored side-by-side. 

We can think of this like a grocery list for sandwiches. Lets imagine that each kind of sandwich we make is composed of 1 type of bread, 1 type of  meat and 1 type of vegetable. We could arrange our grocery list into a table like this: 

| sandwich_id | bread_type | meat_type  | vegetable_type |
|-------------|------------|------------|----------------|
| 0           | sourdough  | ham        | lettuce        |
| 1           | baguette   | turkey     | tomato         |
| 2           | rye        | roast beef | onion          |

We buy all of our bread products from a bakery, meat from a deli, and vegetables from a grocer. The result is that to get everything in a column, you can go to one location (e.g. bakery for bread_type). But to get everything from a row you will have to visit all three locations. 

This means that is it really fast to access an entire column, but really slow to access a row. Lets check it out!

>**Aside**   
>In the code below, the `%timeit` line is called a [magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#). The `timeit` magic lets us time the execution of a Python statement. 

In [None]:
# Setup code -- ideally this is changed to a dataset we are using 
import numpy as np
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.sample(5)

In [None]:
print("Column\n------")
%timeit sl = iris['petal_length']

print("Row\n------")
%timeit example_1 = iris.iloc[12]

This difference in speed is more than a 50x difference! 

## Transformations <a id='transform'></a>
One instance where the underlying storage structure and its consquence on speed is when applying transformations (calculations or other functions) to a DataFrame. 

A common case of this is when we want to add a new column to our DataFrame based on values in other columns. For example, we may want to:  
* Extract month from a data column
* Calculate area from width & length columns
* Predict whether a flight will be late by applying a deep learning model to the values of 5 other columns. 

There is a long list of transformations we might be interested in, many of which operate on a single row, independent of other rows. 

There are many ways to implement transformations in pandas, some of which take advantage of how DataFrames are stored and others that do not. Below, we are going to look at 6 methods for implementing transformations:
1. [For Loops](#for)
2. [`iterrows`](#iter)
3. [Apply Method](#apply)
4. [Zip & Iterate](#zip)
5. [Vectorized Functions](#vec) 
6. [NumPy Vectorized Functions](#np)

We will use a common example across transformation methods allowing us to compare the speed of each one. For each method we will create a new column called `petal_area` by multiplying `petal_length` by `petal_width`.

### 1 For Loops <a id='for'></a>
One possible method we can use to create this new column is to go row-by-row through the dataframe using a [for loop](01_introduction.ipynb#conditional).

For each row in the data frame we will calculate the area for that example by multiplying `petal_length` by `petal_width` and placing the result in a Series that would eventually be added as a column to the DataFrame.

Perhaps one of the most obvious ways to approach a transformation is to go row-by-row through the dataframe, doing the necessary transformations one at a time. A simple way to do this is using a for loop.

In [None]:
%%timeit
# Looping over the rows
area_column = []
for i in range(0, len(iris)):
    row = iris.loc[i]
    row_area = row['petal_length'] * row['petal_width']
    area_column.append(row_area)

While we haven't checked any other methods yet, I'll let you know that this is _really_ slow. If we think about how DataFrames are stored it becomes clear why this is so slow. 

Before I get into this, lets look at the second method. 



### 2  `iterrows()` <a id='iter'></a>
A second method we can use to add our new column is using the `iterrows()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows)). This is a built-in method pandas has implemented to iterate over the rows in a  DataFrame. 

This method creates a [generator object](https://wiki.python.org/moin/Generators), a special Python object, which we can use a for loop to iterate over. 

In [None]:
%%timeit
area_column = []
for idx, row in iris.iterrows():
    row_area = row['petal_length'] * row['petal_width']
    area_column.append(row_area)

While this was definitely faster than the basic [for loop](#for) approach, its still really slow. In fact, the underlying reason why these two approaches are so slow is the same.

Both approaches use a `for` loops to go row-by-row through the DataFrame. Gathering the data for that row as its needed. 

In our sandwich example, this is the equivalent of buying ingredients for sandwich 1, then buying ingredients for sandwich 2, etc. This results in visiting each shop (bakery, deli, grocer) once for every sandwich recipe!

The same thing is happening in pandas. To iterate over the rows using a for loop we retreive all values for row 1, then all values for row 2, etc. 

This is incredibly inefficient (imagine the funny looks you'd get on your 3rd visit to the bakery)! In fact, I would venture to say that **you should never use for loops when working with pandas DataFrames**. There might be cases when I'm wrong, but there is almost always a better approach than `for` loops. 

### 3 Apply Method <a id=apply></a>
A third approach we can use is the `apply()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply)). This built-in pandas method applies a specific function across some axis (rows or columns). In our case, we want to apply a function along the column axis, applying the function to each row. 

To use apply, you have to define the function you want to apply. This function needs to take in a row, apply the function, and return some value. For our case, we'll define an `area()` function. 

In [None]:
%%timeit
def area(row):
    return row['petal_length'] * row['petal_width']

area_column = iris.apply(area, axis=1)

Thankfully this is faster than the previous two approaches. But we are still in the realm of miliseconds. This method is still relatively slow because it continues to go row-by-row through the Dataframe. 

Since `apply()` is used for a specific purpose, pandas is able to make assumptions and include optmizations that the more general approaches don't have access to. For example, the `apply()` method implements 

Its faster because of internal optimizations pandas is able to do. For example, `apply()` checks to see if your function is compatible with its "fast" mode ([docs](https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/frame.py#L6737-L6928)). As well, it offloads some of the work to C (a low-level language known for speed), only performing the functions itself in Python. 

Typically, I almost always avoid using `apply()`. Although, it does make for readable code.

### 4 Zip & Iterate <a id=zip></a>
A fourth method we can use is to use the built-in `zip()` function available in Python ([docs](https://docs.python.org/3.3/library/functions.html#zip)). This function takes in a group of iterators (lists, dictionaries, tuples, etc) and creates a new iterator where the i-th element in the iterator will be a tuple containing the i-th elements from each of the original iterators. 

For example:   
```
>>> l1 = [1, 2, 3]
>>> l2 = ['a', 'b', 'c']
>>> z = zip(l1, l2)
>>> list(z)
    [(1, 'a'), (2, 'b'), (3, 'c')]
```

In general this is quite a useful function that can be used for lots of different purposes. In our case we will:
1. Determine which columns are needed for the transformation
2. Zip these columns together
3. Iterate over the zipped object to retrieve pairs one at a time, applying some function to the pairs and storing the result in a list which will later become our new column

In [None]:
%%timeit
area_column = []
for w, l in zip(iris['petal_length'], iris['petal_width']):
    area = w*l
    area_column.append(area)

As you can see, this method offers a great improvement over our last method (~120x faster). The reason for this large improvement is this is the first method that avoids going row-by-row through the DataFrame. 

Instead of performing many (#rows x #columns) costly read operations, this method reads each column only once. The resulting data is stored temporarily in fast memory, where it can be accessed at little cost when it is needed for calculations. 

This is the method I typically use for calculations. While it offers a good balance of efficiency, readability, and flexibility. 

### 5 Use Vectorized Functions <a id=vec></a>
Depending on the transformation we are undertaking, we might be able to use a vectorized function. These functions operate on entire Series, rather than on individual values (aka vector functions). 

Vectorized functions are those which take in and operate on pandas Series. There are many built-in vectorized functions, such as `*` (shown below), `add()`, `between()`, and `shift`. You can also build your own vectorizef function as a combination of these built-in methods.  

In [None]:
%%timeit
# Straight up vector calculations
iris['petal_area'] = iris['petal_length'] * iris['petal_width']

Similar to option 4, this method is significantly faster than the first three approaches. Once again, this is because we are avoiding accessing rows one-by-one. 

In addition, vectorized functions are able to further optimize by making use of pre-compiled code written in a lower-level (and faster) language like C. 

Honestly, I'm unsure why this method appears to be slower than the our 4th option, the zip method. I suspect this will not always be the case, especially when functions become more complex. 

### 6 Use NumPy Vectorized Functions<a id=np></a>
For an improvement over method 5, we take one extra step and convert our pandas Series into NumPy arrays and apply the same vectorized functions to obtain our transformation. 

In [None]:
%%timeit
areas = np.array(iris['petal_length']) * np.array(iris['petal_width'])

As with the last two options, this approach avoids costly row-by-row reads. Like with option 5, this method also uses pre-compiled code to achieve further optmization. 

In addition, by converting the pandas Series to NumPy arrays, this method removes the overhead incurred by Pandas additional functionality. 

### and Beyond <a id='beyond'></a>
For the cases when even these options aren't fast enough, you can implement more advanced techniques to enhance performance. The improvements these advanced techniques can offer differ based on the problem at hand. For example, some techniques use functions and methods that are optimized for boolean comparisons (e.g. great than) but offer little improvements when working with other functions like addition. 

Some other approaches to checkout include: 
* Using [NumExpr](https://pypi.org/project/numexpr/2.6.1/) for extra fast numerical expressions
* Rewriting functions in [Cython](https://cython.org/)
* Using [Numba](https://numba.pydata.org/) to convert Python code to fast machine code. 

## Takeaways

While this difference in speed is hard (if not impossible) to notice for small datasets, it can become hugely consequential when working with large datasets or performing complex calculations. 

We have to remember that optimizing code should not be placed at the expense of functionality. Often its best to get something that works before going back and finding the most optimal solution. However, I hope that by introducing a couple of "Do's & Don'ts" your first insticts can help you avoid some of the easiest traps.


1. Never directly iterate over the rows in a DataFrame. Avoid anything that goes row-by-row.  
2. Working with NumPy arrays will be faster than pandas Series
3. DataFrame data is stored based on columns, not rows. This means its much faster to access a column than a row. 


# Open Work Time <a id='open'></a>