# FUNDAMENTALS OF DATA ANALYSIS WITH PYTHON <br><font color="crimson">DAY 3: SOCIAL SCIENTIFIC COMPUTING WITH PANDAS</font>

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>

### Overview 

This notebook introduces some fundamentals of scientific computing with `Pandas` and `matplotlib`. `Pandas` is an extremely popular Python package for storing, manipulating, and analyzing data in a tabular form, with rows and columns. We will learn how to get data into `pandas`, and then how to perform common data analysis tasks such as selecting columns, filtering rows, and computing descriptive statistics. Then we will learn how to use `matplotlib` for producing high-quality plots for print or the web. We will use it to create a variety of common statistical plots and other visualizations. 

### Plan for the Day

1. [`Pandas` 101](#pandas)
2. [Best practices for `Pandas`](#pandasbp)
3. [Open Work Time](#open)

<hr>

In [2]:
import os
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'svg' # better resolution with vector graphics! 

# `Pandas` 101<a id='pandas'></a>

Quantitative or computational social scientists are used to working with data in tabular form, such as a `dataframe` with variables in the columns and observations in the rows. In Python, the `Pandas` package enables us to organize, manipulate, and analyze data in this familiar way. 

`Pandas` is an extremely popular package in the scientific computing community regardless of the discipline (physics, sociology, neuroscience, history) or industry (academia, government, industry). It was originally developed for time series analysis. It gets it's name from **pan**el **da**ta. 

This part of the notebook covers some essential functionality of `Pandas` that you will make heavy use of in most data analyses. Of course, we will not cover *everything* that is possible to do with `Pandas`. As with the previous content, the goal is to build a basic foundation that we can build on throughout the week. We will emphasize the functionality that can take you the furthest in any given data analysis project. 

## Reading Data from Files 

`Pandas` makes it easy to load data from an external file directly into a `DataFrame`, which will discuss momentarily. It does so using one of many `reader` functions that are part of a suite of `I/O` (input / output, read / write) tools. For some common examples, see the table below. Information on these and other `reader` functions can be found in the `pandas` [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). The [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) also provides useful information about the parameters for these methods, such as how to specify what sheet you want from an Excel spreadsheet, or whether to write the index to a new `csv` file. 



| Data Description                | Reader          | Writer        |
|:--------------------------------|:----------------|:--------------|
| CSV                             | `read_csv()`   | `to_csv()`   |
| JSON                            | `read_json()`  | `to_json()`  |
| MS Excel and OpenDocument (ODF) | `read_excel()` | `to_excel()` |
| Stata                           | `read_stata()` | `to_stata()` |
| SAS                             | `read_sas()`   | NA            |
| SPSS                            | `read_spss()`  | NA            |


To illustrate how these `reader` functions work, we will use the `read_csv()` function. The only *required* argument is that we provide the path to the location of the file on our computer. 

In this case, we will use the ["Three Million Russian Trolls" dataset](https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/), which consists of data on ~3M tweets from Twitter accounts that are known to be part of state-sponsored disinformation campaigns. This particular dataset was collected and coded by Darrin Linvill and Patrick Warren, of Clemson University. It includes several variables that were hand coded by Linvill and Warren, the most important of which are classifications of accounts into different types. 

The dataset is stored in 12 different `csv` files. They are stored in a directory called `russian-troll-tweets`, which is inside the `data` directory.

In [3]:
!ls data/russian-troll-tweets

IRAhandle_tweets_10.csv  IRAhandle_tweets_2.csv  IRAhandle_tweets_7.csv
IRAhandle_tweets_11.csv  IRAhandle_tweets_3.csv  IRAhandle_tweets_8.csv
IRAhandle_tweets_12.csv  IRAhandle_tweets_4.csv  IRAhandle_tweets_9.csv
IRAhandle_tweets_13.csv  IRAhandle_tweets_5.csv  README.md
IRAhandle_tweets_1.csv	 IRAhandle_tweets_6.csv


Let's start by loading just one of the files. Later we will see how to read in all 12 and combine them into 1 large dataset. 

In [4]:
df = pd.read_csv('data/russian-troll-tweets/IRAhandle_tweets_1.csv')

By default, `pandas` assumes your data is encoded with `UTF-8`. If you see an encoding error, you can switch to a different encoding, such as `latin`.

Once we have our `dataframe`, we can use the `info()` method to see the name of each column, as well as it's integer index and datatype. 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243891 entries, 0 to 243890
Data columns (total 21 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   external_author_id  243891 non-null  int64 
 1   author              243891 non-null  object
 2   content             243891 non-null  object
 3   region              243853 non-null  object
 4   language            243891 non-null  object
 5   publish_date        243891 non-null  object
 6   harvested_date      243891 non-null  object
 7   following           243891 non-null  int64 
 8   followers           243891 non-null  int64 
 9   updates             243891 non-null  int64 
 10  post_type           154592 non-null  object
 11  account_type        243891 non-null  object
 12  retweet             243891 non-null  int64 
 13  account_category    243891 non-null  object
 14  new_june_2018       243891 non-null  int64 
 15  alt_external_id     243891 non-null  int64 
 16  tw

We now have a `dataframe` with 21 variables. The `dataframe` is organized as we would expect: with variables in the columns and observations in the columns. We can use the `.head()` method to preview the top $n$ rows of the dataset. 

In [6]:
df.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,906000000000000000,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,...,Right,0,RightTroll,0,905874659358453760,914580356430536707,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/914580356430...,,
1,906000000000000000,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,...,Right,0,RightTroll,0,905874659358453760,914621840496189440,http://twitter.com/905874659358453760/statuses...,https://twitter.com/damienwoody/status/9145685...,,
2,906000000000000000,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,10/1/2017 22:50,10/1/2017 22:51,1054,9637,255,...,Right,1,RightTroll,0,905874659358453760,914623490375979008,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/913231923715...,,
3,906000000000000000,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,English,10/1/2017 23:52,10/1/2017 23:52,1062,9642,256,...,Right,0,RightTroll,0,905874659358453760,914639143690555392,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/914639143690...,,
4,906000000000000000,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,English,10/1/2017 2:13,10/1/2017 2:13,1050,9645,246,...,Right,1,RightTroll,0,905874659358453760,914312219952861184,http://twitter.com/905874659358453760/statuses...,https://twitter.com/realDonaldTrump/status/914...,,
5,906000000000000000,10_GOP,"Dan Bongino: ""Nobody trolls liberals better th...",Unknown,English,10/1/2017 2:47,10/1/2017 2:47,1050,9644,247,...,Right,0,RightTroll,0,905874659358453760,914320835325853696,http://twitter.com/905874659358453760/statuses...,https://twitter.com/FoxNews/status/91423949678...,,
6,906000000000000000,10_GOP,🐝🐝🐝 https://t.co/MorL3AQW0z,Unknown,English,10/1/2017 2:48,10/1/2017 2:48,1050,9644,248,...,Right,1,RightTroll,0,905874659358453760,914321156466933760,http://twitter.com/905874659358453760/statuses...,https://twitter.com/Cernovich/status/914314644...,,
7,906000000000000000,10_GOP,'@SenatorMenendez @CarmenYulinCruz Doesn't mat...,Unknown,English,10/1/2017 2:52,10/1/2017 2:53,1050,9644,249,...,Right,0,RightTroll,0,905874659358453760,914322215537119234,http://twitter.com/905874659358453760/statuses...,,,
8,906000000000000000,10_GOP,"As much as I hate promoting CNN article, here ...",Unknown,English,10/1/2017 3:47,10/1/2017 3:47,1050,9646,250,...,Right,0,RightTroll,0,905874659358453760,914335818503933957,http://twitter.com/905874659358453760/statuses...,http://www.cnn.com/2017/09/27/us/puerto-rico-a...,,
9,906000000000000000,10_GOP,After the 'genocide' remark from San Juan Mayo...,Unknown,English,10/1/2017 3:51,10/1/2017 3:51,1050,9646,251,...,Right,0,RightTroll,0,905874659358453760,914336862730375170,http://twitter.com/905874659358453760/statuses...,,,


Alternatively, we could use the `.sample()` method to pull a random sample of $n$ observations, which can be helpful if we don't want the observations we preview to be from the top (`head`) or bottom (`tail`) of the dataset.

In [7]:
df.sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
18707,895000000000000000,ACAPARELLA,#acapa Planned Parenthood U.K. Now Pitching Ab...,Unknown,English,8/13/2017 20:54,8/13/2017 20:54,39,7,312,...,Right,0,RightTroll,0,894839911597191168,896837357101604864,http://twitter.com/894839911597191168/statuses...,https://twitter.com/ACaparella/status/89683735...,http://ift.tt/2uATUOO,
177976,2500690416,ANNIEPOSHES,#IGetDepressedWhen i open my paycheck https:/...,United States,English,8/31/2016 15:18,8/31/2016 15:18,1773,1718,2586,...,Hashtager,1,HashtagGamer,1,2500690416,771004192320258049,http://twitter.com/AnniePoshes/statuses/771004...,https://twitter.com/gonnarain/status/770998725...,,
25176,1652138929,ACEJINEV,https://t.co/SwMB9nfKLM #Reawakening #Movie #E...,United States,English,8/2/2017 10:59,8/2/2017 10:59,776,908,7085,...,Left,1,LeftTroll,0,1652138929,892701354929664000,http://twitter.com/1652138929/statuses/8927013...,https://youtu.be/woATgHEUgKo,,
78349,891000000000000000,ALIISTRR,#alis Look Who’s Trying to Flood Congress With...,United States,English,8/15/2017 16:55,8/15/2017 16:55,2922,513,1732,...,Right,0,RightTroll,0,891202510660259840,897501945832722432,http://twitter.com/891202510660259840/statuses...,https://twitter.com/aliistrr/status/8975019458...,http://ift.tt/2w7xEwa,
91446,890000000000000000,AMBERLINETR,#amb SHOCKING VIDEO : Antifa and ISIS are VIRT...,United States,English,8/15/2017 17:05,8/15/2017 17:06,2932,372,2354,...,Right,0,RightTroll,0,890429331373043712,897504537677713409,http://twitter.com/890429331373043712/statuses...,https://twitter.com/amberlinetr/status/8975045...,http://ift.tt/2wNaN6v,
139387,1690617488,ANATOLINEMCOV,Что король Иордании хочет обсудить с Путиным ...,United Arab Emirates,Russian,11/22/2015 17:35,11/22/2015 17:35,468,670,7199,...,Russian,1,NonEnglish,1,1690617488,668482902871642113,http://twitter.com/AnatoliNemcov/statuses/6684...,https://twitter.com/GazetaRu/status/6684710586...,http://www.gazeta.ru/politics/news/2015/11/22/...,
79691,2243468839,ALINALINKI_,Шокирующее видео момента падения боинга на Укр...,Unknown,Russian,7/20/2015 7:59,7/20/2015 7:59,1178,282,1608,...,Russian,1,NonEnglish,1,2243468839,623039498587381760,http://twitter.com/alinalinki_/statuses/623039...,http://boeing-is-back.livejournal.com/266128.html,,
114553,1679279490,AMELIEBALDWIN,Adam Schiff's eyes and body language bother me...,United States,English,3/21/2017 8:16,3/21/2017 8:16,2303,2753,33383,...,Right,1,RightTroll,0,1679279490,844100376022401024,http://twitter.com/1679279490/statuses/8441003...,,,
48478,2570250275,AIDEN7757,'@KeshaTedder @ZenRand @AFrikkinHashtag @Johns...,United States,English,7/27/2016 10:58,7/27/2016 10:58,1086,555,1103,...,Hashtager,1,HashtagGamer,0,2570250275,758255204127088640,http://twitter.com/Aiden7757/statuses/75825520...,,,
173484,839000000000000000,ANNAROMAN0,Presidenziali #Francia: #Fillon supera #Melenc...,Italy,Italian,4/19/2017 5:09,4/19/2017 5:09,564,96,2847,...,Italian,1,NonEnglish,0,838682742229504001,854562590344626176,http://twitter.com/838682742229504001/statuses...,http://bit.ly/2oS1CQA,,


To load up the full dataset -- which is spread across 12 files -- we can read in each `csv` file and concatenate them all into a single `dataframe`. Note that if your data is contained in a single file, this step would not be necessary. 

In [8]:
data_dir = os.listdir('data/russian-troll-tweets')
data_dir

['README.md',
 'IRAhandle_tweets_6.csv',
 'IRAhandle_tweets_10.csv',
 'IRAhandle_tweets_13.csv',
 'IRAhandle_tweets_11.csv',
 'IRAhandle_tweets_3.csv',
 'IRAhandle_tweets_12.csv',
 'IRAhandle_tweets_5.csv',
 '.ipynb_checkpoints',
 'IRAhandle_tweets_7.csv',
 'IRAhandle_tweets_9.csv',
 'IRAhandle_tweets_1.csv',
 'IRAhandle_tweets_8.csv',
 'IRAhandle_tweets_4.csv',
 'IRAhandle_tweets_2.csv']

In [9]:
files = [f for f in data_dir if 'csv' in f]
files 

['IRAhandle_tweets_6.csv',
 'IRAhandle_tweets_10.csv',
 'IRAhandle_tweets_13.csv',
 'IRAhandle_tweets_11.csv',
 'IRAhandle_tweets_3.csv',
 'IRAhandle_tweets_12.csv',
 'IRAhandle_tweets_5.csv',
 'IRAhandle_tweets_7.csv',
 'IRAhandle_tweets_9.csv',
 'IRAhandle_tweets_1.csv',
 'IRAhandle_tweets_8.csv',
 'IRAhandle_tweets_4.csv',
 'IRAhandle_tweets_2.csv']

We will overwrite the `df` created earlier. 

In [10]:
df = pd.concat((pd.read_csv('data/russian-troll-tweets/{}'.format(f), encoding='utf-8', low_memory=False) for f in files if 'csv' in f))
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2946207 entries, 0 to 250519
Data columns (total 21 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   external_author_id  object
 1   author              object
 2   content             object
 3   region              object
 4   language            object
 5   publish_date        object
 6   harvested_date      object
 7   following           int64 
 8   followers           int64 
 9   updates             int64 
 10  post_type           object
 11  account_type        object
 12  retweet             int64 
 13  account_category    object
 14  new_june_2018       int64 
 15  alt_external_id     object
 16  tweet_id            int64 
 17  article_url         object
 18  tco1_step1          object
 19  tco2_step1          object
 20  tco3_step1          object
dtypes: int64(6), object(15)
memory usage: 494.5+ MB


In this case, we have two datatypes in our `dataframe`: `object` and `int64`. `Pandas` uses `object` to refer to columns that contain `strings`, or which contain mixed types, such as `strings` and `integers`. In this case, they refer to `strings`. `int64` are integers. In addition to these two data types, `pandas` stores `floats` (`float64`), booleans (True or False), several specialized `datetime` data structures, and categorical variables.  

One further thing to note about this dataset: **each row is a tweet from a specific account, but some of the variables describe attributes of the tweeting accounts, not of the tweet itself**. For example, `followers` describes the number of followers that the account had at the time it sent the tweet. This makes sense, because tweets don't have followers, but accounts do. We need to keep this in mind when working with this dataset. 

## Understanding `Pandas` Data Structures 

Now that we have a `dataframe` loaded into memory, we can move on to some interesting data analyses. But first, let's devote a bit of time to clarifying `pandas` data structures. 

### Background Knowledge &mdash; Dynamic Typing [<i class="fa fa-forward"></i>](#skip_dynamic)

> Note: feel free to use the [<i class="fa fa-forward"></i>](#essential_series) button above to temporarily [skip](#essential_series) over this "background knowledge" section if you are feeling overwhelmed with new information. It is useful to know, but it is not *essential* knowledge for using Pandas to analyze data. You will not lose much if you come back to this at some point in the future, when you are more comfortable with basic `pandas` data structures and operations. 

First, some background knowledge. Python is a dynamically typed language. What that means is that you don't need to constantly tell Python what kind of object something is. For example, if you add two numbers together

In [11]:
42 + 8

50

it is not necessary to tell Python that `42` and `8` are integers. Instead, Python stores that metadata in each object. 

When we store data in a list, every element in the list is actually a Python object itself, containing not only the actual data itself (e.g. `42`), but also information about the **type** of data that it is, which in this case is `int64`. This is enormously useful in many cases, because we store objects of different types in a `list`.

In [12]:
some_data = [42, 8.0, 'a string']
print(some_data)

[42, 8.0, 'a string']


In this example, each element in `some_data` also contains information about the type of object it is. As previously mentioned, this is enormously helpful in some contexts, but dramatically slows down computation in other contexts. Data analysis is one example of where, depending on what you are trying to do, dynamic typing can slow things down rather a lot. 

When you are analyzing data, you are almost always working with some collection of elements that are all of the same type, such as integers, floats, strings, or Boolean values. For example, you can't compute the mean and standard deviation of a collection of elements that include both integers and strings. So it follows that data analysis can be made more efficient by working on data structures where information about data types is stored at the level of the collection itself rather than each element in the collection, *provided the data is all of the same type*. 

One of the main tools for doing this in `Python` is the `numpy` package, which is more or less the foundation of all data analysis in `Python`, whether you explicitly use it or not. `numpy` provides data structures for working with `arrays` of data that are a bit like lists except that all elements are of the same type, information about that type is stored at the level of the `array` itself, and each element in the `array` has an explicit integer index. `arrays` can be one dimensional vectors or multi-dimensional matrices. 

Further discussion of `numpy` is beyond the scope of this class. For our purposes here, what you need to know is that `pandas` builds on top of `numpy` and offers an additional set of data structures and methods that are designed explicitly to meet the needs to researchers working with real-world empirical data. Like `numpy`, `pandas` is designed to make scientific computing more efficient, but as we will learn below there are some common pitfalls to avoid that, if you are not careful, can actually make working with `pandas` slow and inefficient. 

> **Jillian**, where do you think we should add some information about `axes` for `dataframes`? We only need a sentence or two I think, but I am not sure where the best place to put it is. 

<a id='skip_dynamic'></a>
## Back to Essential `Pandas`: `Series` and `index`

Each column in a `dataframe` is an object called a `series`. A `series` is a one-dimensional object, such as a vector of numbers. However, that vector is associated with an `index`, which is a vector, or array, of labels. 

For example, the column `retweet` in our Russian troll `dataframe` is a `series` of integers (number of times a tweet was retweeted) and their `index` labels. 

In [13]:
num_followers = df['followers']
type(num_followers)

pandas.core.series.Series

Below, we pull a sample of 25 tweets from the `series`. The value on the left is the index label for the observation, the number on the right is the actual data value (number of retweets). The index values are sequential in the actual `series`, but they are out of sequence here because we pulled a random sample. 

In [14]:
num_followers.sample(25)

223481     2755
180515     3047
216733    18054
80844      2407
70988       351
131168    40241
115936    17842
116577    18600
168820      819
107048      116
1097      61839
137667      253
168398      751
95046       105
101146       13
98621      7697
95206     21670
99876       131
149622      238
194434     1798
182687     2599
31449       118
183122     2580
93247        87
42423       213
Name: followers, dtype: int64

In most cases, the default `index` for a `series` or `dataframe` is an immutable vector of integers:

In [15]:
num_followers.index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            250510, 250511, 250512, 250513, 250514, 250515, 250516, 250517,
            250518, 250519],
           dtype='int64', length=2946207)

In some cases, such as time series analysis, the `index` might default to a `DatetimeIndex` or a `PeriodIndex`, but we will not consider those in this course. If you are working with time series data, the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) provides explanations of how to use these types of `indices`.

We can easily modify an `index` so that it is made of up some other type of vector instead, including a `string`. Surprisingly, `index` values do not need to be unique (technically, they are a `multiset`, or a `set` that is allowed to have repeat elements). This enables us to do some powerful things, but most of the time you should avoid manually changing indexes. 

We can use the `index` to retrieve specific values from a `series` much as we would if we were selecting an element from a `list`, `tuple`, or `array`.

### Operations on `Series`: Descriptive Statistics

As we will soon see, there are a number of operations we can perform on `Series`, such as simple descriptive statistics like mean, median, mode, and standard deviation.

In [16]:
print('Median ', num_followers.median())
print('Mean ', num_followers.mean())
print('Standard Deviation ', num_followers.std())

Median  1274.0
Mean  7055.265491868019
Standard Deviation  14635.939344602943


Since the values returned from operations on `Series` are essentially equivalent to a `numpy` `array`, we can use `numpy` methods on `Series` objects. For example, we can use the `round()` method from `numpy` to round these descriptives to a few decimal points. 

In [17]:
print('Median ', np.round(num_followers.median(), 3))
print('Mean ', np.round(num_followers.mean(), 3))
print('Standard Deviation ', np.round(num_followers.std(), 3))

Median  1274.0
Mean  7055.265
Standard Deviation  14635.939


We can also `count` the number of non-missing observations in a `series`

In [18]:
num_followers.count()

2946207

or get an overview of multiple descriptives at once:

In [19]:
num_followers.describe()

count    2.946207e+06
mean     7.055265e+03
std      1.463594e+04
min     -1.000000e+00
25%      3.220000e+02
50%      1.274000e+03
75%      1.085300e+04
max      2.512760e+05
Name: followers, dtype: float64

If our series is categorical, we can also easily compute useful information such as the number of unique categories, the size of each category, and so on. For example, let's look at the `account_type` `series`.

In [20]:
atype = df['account_type']

In [21]:
atype.unique()

array(['Right', 'Russian', '?', 'Koch', 'Hashtager', 'Commercial', 'Left',
       'local', 'Arabic', 'news', 'German', 'Spanish', 'French',
       'Italian', 'Ebola ', 'Portuguese', 'Uzbek', 'Ukranian',
       'ZAPOROSHIA'], dtype=object)

In [22]:
atype.value_counts()

Right         711668
Russian       704917
local         459220
Left          427141
Hashtager     241786
news          139006
Commercial    121904
German         91511
Italian        15680
?              13539
Koch           10894
Arabic          6228
Spanish         1226
French          1117
ZAPOROSHIA       175
Portuguese       118
Ebola             71
Ukranian           4
Uzbek              2
Name: account_type, dtype: int64

Later, we will consider some summary statistics for pairs of `series`, such as computing correlations and covariance. 

## DataFrames

We already have our `DataFrame` loaded into memory (as `df`), but so far all we have used it for is pulling out individual `series`. This is easy to do in part because `DataFrames` are themselves just collections of `Series` that are aligned on the same `index` values. In other words, both `Series` we worked with previously -- `atype` and `num_followers` -- have their own `indexes` when we work with them as `Series`, but in a `DataFrame`, they share an index. `DataFrames` are organized the way we would expect: with variables in the columns and observations in the rows. We can use the `.head()` method to preview the top $n$ rows of the dataset. 

In [23]:
df.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,2535818742,HAPPKENDRAHAPPY,Bosh situation hasn't changed since Feb. Heat ...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2285,...,Right,1,RightTroll,0,2535818742,779366817752084481,http://twitter.com/happkendrahappy/statuses/77...,,,
1,2535818742,HAPPKENDRAHAPPY,Youre an IDIOT! Now you say 99% when before yo...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2284,...,Right,1,RightTroll,0,2535818742,779366807446642688,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/erecordscity/status/779354...,,
2,2535818742,HAPPKENDRAHAPPY,Charlotte-Mecklenburg Fraternal Order of Polic...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2286,...,Right,1,RightTroll,0,2535818742,779366828585910272,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
3,2535818742,HAPPKENDRAHAPPY,Theodore Roosevelt's son Quentin and his frien...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2290,...,Right,1,RightTroll,0,2535818742,779367073998835712,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/HistoryInPics/status/73882...,,
4,2535818742,HAPPKENDRAHAPPY,.@flashfire451: Are there more cures than term...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2289,...,Right,1,RightTroll,0,2535818742,779367063479521281,http://twitter.com/happkendrahappy/statuses/77...,,,
5,2535818742,HAPPKENDRAHAPPY,Suspected Illegal Alien Marijuana Farmers Held...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2288,...,Right,1,RightTroll,0,2535818742,779367053077712896,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
6,2535818742,HAPPKENDRAHAPPY,This picture is 100% BOGUS! Just watch MSNBC &...,United States,English,9/23/2016 17:10,9/23/2016 17:10,1311,1688,2291,...,Right,1,RightTroll,0,2535818742,779367286385831937,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/micheleredding2/status/779...,,
7,2535818742,HAPPKENDRAHAPPY,Proud to be part-Polish! Poland Initially Appr...,United States,English,9/23/2016 17:11,9/23/2016 17:11,1311,1688,2293,...,Right,1,RightTroll,0,2535818742,779367574454800384,http://twitter.com/happkendrahappy/statuses/77...,http://www.lifenews.com/2016/09/23/poland-pois...,,
8,2535818742,HAPPKENDRAHAPPY,in your case Hillary if you do when you're go...,United States,English,9/23/2016 17:11,9/23/2016 17:12,1311,1688,2292,...,Right,1,RightTroll,0,2535818742,779367562782117889,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/cboutet11/status/778718668...,,
9,2535818742,HAPPKENDRAHAPPY,#ThingsMoreTrustedThanHillary Any fairy tale book,United States,English,9/27/2016 1:35,9/27/2016 1:37,1311,1686,2293,...,Right,0,RightTroll,0,2535818742,780581631342080001,http://twitter.com/happkendrahappy/statuses/78...,,,


Alternatively, we could use the `.sample()` method to pull a random sample of $n$ observations, which can be helpful if we don't want the observations we preview to be from the top (`head`) or bottom (`tail`) of the dataset.

In [24]:
df.sample(5)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
92297,2543205364,OLGAMOROZOVAMSK,Николя Саркози призвал западные страны не изол...,Unknown,Russian,10/29/2015 8:53,10/29/2015 8:53,188,279,2252,...,Russian,1,NonEnglish,1,2543205364,659654333777211392,http://twitter.com/olgamorozovamsk/statuses/65...,http://tass.ru/mezhdunarodnaya-panorama/2387920,,
231897,2260338140,POLITICS_T0DAY,https://t.co/uUC6n0HboL,United States,Russian,12/21/2016 17:15,12/21/2016 17:16,142,1035,21522,...,Russian,0,NonEnglish,0,2260338140,811621212946399234,http://twitter.com/2260338140/statuses/8116212...,https://twitter.com/politics_t0day/status/8116...,,
102752,2570574680,RIAFANRU,Соцсети США накрыло истеричной русофобской вол...,Belarus,Russian,7/29/2017 6:00,7/29/2017 6:00,6504,12896,88749,...,Russian,0,NonEnglish,0,2570574680,891176559188611072,http://twitter.com/2570574680/statuses/8911765...,https://twitter.com/riafanru/status/8911765591...,https://riafan.ru/888461-socseti-ssha-nakrylo-...,
107877,3254273689,FINDDIET,http://t.co/s148t6FTK7 iLl @M4RK_RTR Luis @lui...,United States,French,8/13/2015 19:38,8/13/2015 19:38,4,368,32187,...,Commercial,1,Commercial,1,3254273689,631912609214627840,http://twitter.com/FindDiet/statuses/631912609...,https://twitter.com/dre_galvan11/status/631912...,http://WWW.LOSEFATTIPS.PW/TIPS/TO-WORKOUT-OR-G...,
28537,508761973,NOVOSTISPB,Мы сами можем создавать праздник себе и окружа...,Russian Federation,Russian,12/14/2016 11:05,12/14/2016 11:05,8361,106821,39279,...,Russian,0,NonEnglish,1,508761973,808991404169031680,http://twitter.com/508761973/statuses/80899140...,https://twitter.com/NovostiSPb/status/80899140...,,


When working with a `dataframe`, we can select subsets of data by selecting columns or filtering rows. Let's look at selecting columns first. 

### Selecting Columns 

Earlier, we saw how we could select a single column using by specifying the name of the `dataframe` followed by the name of the `series` inside square brackets and straight quotes. 

In [25]:
followers = df['followers']
followers.sample(10)

18156      1967
212919    27250
74288       113
58185       866
195029      105
104790    12908
79910       218
56212      1955
111258    17029
98135       672
Name: followers, dtype: int64

We can select multiple columns by passing a list of column names. Whereas the result of the previous selection was a `Series` (because we only pulled one column), selecting multiple columns will return a `DataFrame` containing only the requested columns. 

In [26]:
ff = df[['followers', 'following']]
ff.sample(10)

Unnamed: 0,followers,following
63962,12169,9310
88613,797,746
196593,1265,32
113992,2742,2304
200096,7178,6744
64385,526,699
122599,974,888
212413,284,549
29798,83,946
43896,21939,5310


This kind of subsetting can be very helpful when, for example, you are working with datasets that have a lot of columns, only some of which are required for your analysis. 

### Filtering Rows 

It is also sometimes necessary to filter rows. There are a variety of ways to do this, including slices (e.g. all observations between index $i_i$ and index $i_j$). In a data analysis context, most of the row filtering you will do is likely to be based on some sort of explicit condition, such as "give me all the observations with more than 1,000 followers." Most likely, you will only filter rows based on subsets if you are selecting the first $n$ rows of a `DataFrame` that has been sorted by the values of some `Series`. We will consider this case later. 

In [27]:
df[df['followers'] >= 1000].sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
156606,1674083671,GAB1ALDANA,#WhatIHateIn5Words People who exclusively wear...,United States,English,7/27/2016 10:06,7/27/2016 10:06,2629,1914,2035,...,Hashtager,1,HashtagGamer,0,1674083671,758242114656309248,http://twitter.com/Gab1Aldana/statuses/7582421...,,,
255290,2580516159,JMSCOXXX,Britney Spears is such an amazing woman and mo...,United States,English,3/24/2017 9:45,3/24/2017 9:46,3661,4403,5025,...,Hashtager,0,HashtagGamer,0,2580516159,845209997482967045,http://twitter.com/2580516159/statuses/8452099...,,,
213908,2535818742,HAPPKENDRAHAPPY,"no, he's dead and i read even the family is d...",United States,English,12/15/2016 21:55,12/15/2016 21:55,1771,1803,3728,...,Right,1,RightTroll,0,2535818742,809517206379884545,http://twitter.com/2535818742/statuses/8095172...,,,
165977,743167000000000000,COVFEFENATIONUS,Roy Moore should take advantage of being calle...,United States,English,11/15/2017 17:30,11/15/2017 17:31,247,2188,147135,...,Right,1,RightTroll,1,743166519157227520,930850603135148033,http://twitter.com/743166519157227520/statuses...,,,
176935,743167000000000000,COVFEFENATIONUS,'@realDonaldTrump: A Rising @China Under Xi Pr...,United States,English,11/8/2017 21:40,11/8/2017 21:40,247,2066,141588,...,Right,1,RightTroll,1,743166519157227520,928376612906659840,http://twitter.com/743166519157227520/statuses...,http://bit.ly/2zefAiA,,
154419,4224729994,TEN_GOP,HEARTBREAKING: The son of fallen Detroit polic...,United States,English,9/23/2016 19:29,9/23/2016 19:29,29578,33305,4130,...,Right,0,RightTroll,0,4224729994,779402374582591488,http://twitter.com/TEN_GOP/statuses/7794023745...,https://twitter.com/TEN_GOP/status/77940237458...,,
13206,3071479646,BALTIMORE0NLINE,Police investigate air bag thefts in Howard Co...,United States,English,4/14/2017 20:28,4/14/2017 20:28,7493,6856,15374,...,local,0,NewsFeed,0,3071479646,852981834396905473,http://twitter.com/3071479646/statuses/8529818...,https://twitter.com/Baltimore0nline/status/852...,http://www.wbaltv.com/article/police-investiga...,
18334,2601235821,TODAYPITTSBURGH,Penn State “Thon” Raises Nearly $10 Million #...,United States,English,2/21/2016 21:28,2/21/2016 21:28,7580,16170,24807,...,local,0,NewsFeed,0,2601235821,701518772817915904,http://twitter.com/TodayPittsburgh/statuses/70...,,,
66665,1716561367,IIDDAAMARKS,#NowPlaying CEO/COMEDIAN/ARTIST Producer 9-0 i...,United States,English,1/22/2017 7:44,1/22/2017 7:44,909,1118,4348,...,Left,1,LeftTroll,0,1716561367,823073853752672256,http://twitter.com/1716561367/statuses/8230738...,http://tidal.com/artist/6245931,,
41913,2571997365,TONEPORTER,#ICelebrateTrumpWith a bucnh of friends who sa...,United States,English,11/9/2016 16:25,11/9/2016 16:25,1899,1578,2330,...,Hashtager,0,HashtagGamer,0,2571997365,796388126965006336,http://twitter.com/2571997365/statuses/7963881...,,,


Alternatively, we could filter based on membership in some category, such as being a `RightTroll` or `LeftTroll` account. `RightTroll` and `LeftTroll` are attributes of the `account_category` `series`. Let's get `RightTroll` accounts. 

In [28]:
df[df['account_category'] == 'RightTroll'].sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
181833,789000000000000000,WORLDNEWSPOLI,Getting up to speed on the Wells Fargo sales s...,Unknown,English,4/10/2017 19:22,4/10/2017 19:22,4308,3023,22211,...,Right,0,RightTroll,0,789266125485998080,851515833209597952,http://twitter.com/789266125485998080/statuses...,http://www.washingtontimes.com/news/2017/apr/1...,,
49021,898000000000000000,CHARMEESTRS,BREAKING: At Least 5 Dead as 8.1 Magnitude Ear...,Unknown,English,9/8/2017 16:53,9/8/2017 16:53,4858,1277,626,...,Right,0,RightTroll,0,898452282181730305,906198783935090690,http://twitter.com/898452282181730305/statuses...,http://zpr.io/PQieZ,,
56037,1690487623,MICHELLEARRY,You gotta put up a message at https://t.co/yhH...,United States,English,1/22/2017 20:20,1/22/2017 20:20,3232,3254,2886,...,Right,1,RightTroll,0,1690487623,823263992441401345,http://twitter.com/1690487623/statuses/8232639...,https://twitter.com/Trump_Monument/status/8229...,http://TrumpUSAforever.com,
226755,1676481360,EMILEEWAREN,How many terrorist did that to Paris? 8. out o...,United States,English,11/17/2015 8:39,11/17/2015 8:39,550,329,1022,...,Right,1,RightTroll,0,1676481360,666536035829043200,http://twitter.com/EmileeWaren/statuses/666536...,https://twitter.com/YMcglaun/status/6662827410...,,
33156,892000000000000000,CHAASNTR,"RT Nonna_Ni: If he runs looking like that, he ...",Unknown,English,8/2/2017 3:49,8/2/2017 3:49,977,592,728,...,Right,1,RightTroll,0,891902187130966017,892593205665042432,http://twitter.com/891902187130966017/statuses...,https://twitter.com/politicalHEDGE/status/8924...,,
189194,870000000000000000,EISSYT56T,"Ronna Romney McDaniel, RNC chair: RNC will def...",Unknown,English,6/11/2017 20:23,6/11/2017 20:23,0,11,494,...,Right,0,RightTroll,0,870497148365754368,873999066295787520,http://twitter.com/870497148365754368/statuses...,https://twitter.com/EissyT56T/status/873999066...,http://ceesty.com/qKa4KI,
136245,895000000000000000,ANAALESSIS,RT BenjaminZand: China says it'll side with No...,Unknown,English,8/11/2017 14:44,8/11/2017 14:44,21,1,142,...,Right,0,RightTroll,0,894798854717140992,896019406203101185,http://twitter.com/894798854717140992/statuses...,https://www.standard.co.uk/news/world/china-wa...,,
67452,892000000000000000,DANISSTRS,RT AgentSoulful007: Keyword here is Adults. Th...,Unknown,English,8/17/2017 21:27,8/17/2017 21:27,2563,790,3229,...,Right,0,RightTroll,0,891930470212141056,898295168117252096,http://twitter.com/891930470212141056/statuses...,https://twitter.com/i/web/status/8982486780743...,,
53277,1671234620,HYDDROX,IT'S TIME TO ABOLISH THE DEPT OF EDUCATION htt...,United States,English,3/31/2017 15:59,3/31/2017 15:59,2551,2241,18226,...,Right,1,RightTroll,0,1671234620,847840723038871552,http://twitter.com/1671234620/statuses/8478407...,https://twitter.com/Cutiepi2u/status/847831132...,,
210019,2912754262,PIGEONTODAY,'@FoxNews They needed that donkey statue to im...,United States,English,10/24/2015 7:51,10/24/2015 7:51,10292,18174,9404,...,Right,0,RightTroll,0,2912754262,657826698520633344,http://twitter.com/PigeonToday/statuses/657826...,,,


We are left with a subset of 711,668 accounts (check yourself: `len(df[df['account_category'] == 'RightTroll'])`) that are classified as `RightTrolls`. 

### Removing Duplicates
One special case of filtering is the ability to remove duplicate rows from a `DataFrame`. This is often required when multiple rows can refer to the same real-world entity, but there are certain values which will remain the constant for each of that entity's rows. 

For example, we may be interested in counting the number of accounts related to each region in our Russian Trolls dataset. We would start by selecting the `region` and `author` columns.

In [162]:
author_region = df[['author', 'region']]
author_region.head(10)

Unnamed: 0,author,region
0,HAPPKENDRAHAPPY,United States
1,HAPPKENDRAHAPPY,United States
2,HAPPKENDRAHAPPY,United States
3,HAPPKENDRAHAPPY,United States
4,HAPPKENDRAHAPPY,United States
5,HAPPKENDRAHAPPY,United States
6,HAPPKENDRAHAPPY,United States
7,HAPPKENDRAHAPPY,United States
8,HAPPKENDRAHAPPY,United States
9,HAPPKENDRAHAPPY,United States


Some authors have multiple rows in this `DataFrame` since they have authored multiple tweets. If we were count how often each region appears in this dataset, we would over-estimate regions with more prolific tweeters. 

Instead, we will de-duplicate the `DataFrame`. 

In [163]:
author_region.drop_duplicates()

Unnamed: 0,author,region
0,HAPPKENDRAHAPPY,United States
67,HAPPYDAAAYYY,Unknown
72,HARERETRT,Unknown
78,HARKOVLIVE,Unknown
220,HARRYLEVVIS,United States
...,...,...
247502,CARLOSSAMANOS,United States
247548,CARLOS_HNES,United States
247613,CARMELMELLER,United States
247619,CARREDTRT,United States


In this case, de-duplication works on the entire row, ignoring the Index. However, if an author tweeted multiple times from different regions we might see that auther continue to appear multiple times in the dataset. This is because even though the author category has duplicates, the combinations of author and region would be unique. 

If we want to ensure each author is only included in the dataset once, we can drop duplicates based on a subset of columns. 

In [164]:
author_region.drop_duplicates(subset='author')

Unnamed: 0,author,region
0,HAPPKENDRAHAPPY,United States
67,HAPPYDAAAYYY,Unknown
72,HARERETRT,Unknown
78,HARKOVLIVE,Unknown
220,HARRYLEVVIS,United States
...,...,...
244957,CARLLTHERITR,Unknown
247502,CARLOSSAMANOS,United States
247548,CARLOS_HNES,United States
247613,CARMELMELLER,United States


In the cell below, use `drop_duplicates` on `df` to determine how many tweets in our dataset have duplicate content. 

In [166]:
# Your Answer Here

### Adding New Columns Using Transformations
Often, we need to add new columns to our `DataFrame` based on values in other columns.   

In [29]:
# To save our computers we will use a subset
small_df = df.sample(1000)

Sometimes, these new columns are transformations of a single column that already exists in the `DataFrame`.    

For example, we can create a new `empty_tweet` column. This column, will be `True` when the `content` column is empty and `False` otherwise. 

In [30]:
small_df['empty_tweet'] = small_df['content'].isna()

We can also implement more complex transformations, such as those defined in custom functions. 

For example, the code below uses a custom function to extract the number of hashtags used in a tweet. 

In [35]:
def num_hashtags(row):
    tweet = row['content']
    try:
        num = tweet.count('#')
        return num
    except AttributeError:
        return 0

small_df['num_hashtags'] = small_df.apply(num_hashtags, axis=1)

In other cases, we will want to use multiple columns to create a new column.    

For example, we may want to extract the  calculate the follower-to-following ratio for accounts on Twitter. 

In [36]:
small_df['followers_following_ratio'] = small_df['followers'] / small_df['following']
small_df.sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1,empty_tweet,num_hashtags,followers_following_ratio
56483,3083086600,ALDRICH420,China Cuts Dollar Weight In FX Basket In Despe...,United States,English,12/29/2016 14:49,12/29/2016 14:49,1650,1532,1654,...,0,3083086600,814483447976787968,http://twitter.com/3083086600/statuses/8144834...,http://www.zerohedge.com/news/2016-12-29/china...,,,False,0,0.928485
173310,2533001646,JASPER_FLY,"#PoliceAMovie Tinker, Tailor, Soldier, Spy an...",United States,Romanian,7/18/2016 13:37,7/18/2016 13:37,1674,621,1971,...,0,2533001646,755033807858782208,http://twitter.com/Jasper_Fly/statuses/7550338...,,,,False,1,0.370968
153023,1710804738,COOKNCOOKS,Who wants to see @JamesOKeefeIII #debate face-...,United States,English,10/20/2016 8:10,10/20/2016 8:10,1456,1455,2142,...,0,1710804738,789015847247474688,http://twitter.com/CooknCooks/statuses/7890158...,,,,False,3,0.999313
83530,1622690647,VALYAMOOR,Депутат Госдумы просит прекратить авиасообщени...,Russian Federation,Russian,11/24/2015 12:31,11/24/2015 12:31,217,340,2456,...,1,1622690647,669131306815651841,http://twitter.com/ValyaMoor/statuses/66913130...,https://twitter.com/rianru/status/669119381503...,http://ria.ru/politics/20151124/1327504772.html,,False,0,1.56682
217505,2882331822,JENN_ABRAMS,Don't forget about the #DemDebate tonight! Who...,United States,English,11/14/2015 17:33,11/14/2015 17:33,11286,35382,10734,...,0,2882331822,665583382240235521,http://twitter.com/Jenn_Abrams/statuses/665583...,,,,False,1,3.135035
173265,2530830345,NEWORLEANSON,9 amazing things Drew Brees did in the 2015 se...,United States,English,1/4/2016 19:54,1/4/2016 19:54,16695,26016,36977,...,0,2530830345,684100550334484480,http://twitter.com/NewOrleansON/statuses/68410...,,,,False,1,1.558311
146939,4224729994,TEN_GOP,President Trump on fighting terrorism: “Americ...,United States,English,2/15/2017 18:00,2/15/2017 18:01,74054,82604,7207,...,0,4224729994,831926262486728704,http://twitter.com/4224729994/statuses/8319262...,https://twitter.com/TEN_GOP/status/83192626248...,,,False,1,1.115456
202793,1651693646,CYNTHIAMHUNTER,Clinton Foundation admits missteps in donor di...,United States,English,4/26/2015 21:50,4/26/2015 21:50,354,180,494,...,0,1651693646,592445568443674626,http://twitter.com/CynthiaMHunter/statuses/592...,,,,False,1,0.508475
162330,2671070290,PATRIOTBLAKE,"For God's sake, Wake up Black America. https:/...",United States,English,9/19/2016 18:24,9/19/2016 18:24,2305,2012,1685,...,0,2671070290,777936320555606016,http://twitter.com/PatriotBlake/statuses/77793...,https://twitter.com/SheriffClarke/status/76962...,,,False,0,0.872885
14373,2587100717,JUDELAMBERTUSA,Parents Outraged After School Budget Cuts Due ...,United States,English,3/28/2017 9:01,3/28/2017 9:01,1948,1779,3648,...,0,2587100717,846648429338116096,http://twitter.com/2587100717/statuses/8466484...,http://dld.bz/f8fEQ,,,False,0,0.913244


Once again, we can use a custom function to transform multiple columns to create one new column.   

In the cell below, use a custom function and the `apply()` method to create a new column called `more_followers` from the `followers` and `following` columns. This column should be `True` if an account has more followers than following, and `False` otherwise.

In [37]:
# Your Answer Here

Checkout the results of our transformations in the `DataFrame` below. 

In [38]:
small_df.sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1,empty_tweet,num_hashtags,followers_following_ratio
19316,2513294525,KADEHUMBER,"Кстати Слуцкий идет по дороге Тарасова, тот то...",Azerbaijan,Russian,8/7/2015 15:42,8/7/2015 15:42,100,146,1695,...,1,2513294525,629678973434335232,http://twitter.com/KadeHumber/statuses/6296789...,,,,False,1,1.46
178560,1682444790,PEYTONCASHOUT,Cops can't be gentle even at home � https://t....,United States,English,4/25/2016 16:21,4/25/2016 16:21,371,297,867,...,0,1682444790,724634465310384130,http://twitter.com/PeytonCashOut/statuses/7246...,https://twitter.com/KeeganNYC/status/724330585...,,,False,0,0.800539
50103,701401000000000000,NOVOSTI_SOCHI,Состояние отрасли чаеводства в Краснодарском к...,Unknown,Russian,5/24/2016 12:57,5/24/2016 12:57,100,18,1725,...,1,701400844181360640,735092282743881728,http://twitter.com/novosti_sochi/statuses/7350...,http://bit.ly/1TwNqBr,,,False,0,0.18
188102,789000000000000000,WORLDNEWSPOLI,Ethiopia's star singer Teddy Afro makes plea f...,United States,English,5/13/2017 10:29,5/13/2017 10:29,4259,3013,28574,...,0,789266125485998080,863340370880454657,http://twitter.com/789266125485998080/statuses...,http://www.washingtontimes.com/news/2017/may/1...,,,False,0,0.707443
78109,2732675512,BGARNER2107,"#ReasonsIAintInARelationship,the people i want...",United States,English,1/23/2017 13:13,1/23/2017 13:15,3756,4444,5593,...,0,2732675512,823518995496243202,http://twitter.com/2732675512/statuses/8235189...,,,,False,1,1.183174
220718,898000000000000000,CAMELIISRT,Traitor Paul Ryan Wants Congress to Legalize E...,United States,English,9/2/2017 16:51,9/2/2017 16:51,3915,1667,388,...,0,898394737618501632,904023982734807044,http://twitter.com/898394737618501632/statuses...,http://zpr.io/PQ6uF,,,False,0,0.425798
47100,1655194147,MELANYMELANIN,'@smilinisle @Twiggy164 @smaddoxsr @BlackjediN...,United States,English,3/18/2017 3:03,3/18/2017 3:03,895,941,4943,...,0,1655194147,842934452074561537,http://twitter.com/1655194147/statuses/8429344...,,,,False,0,1.051397
22954,901000000000000000,POL_WARRIOR,Buckingham Palace in London on lockdown as pol...,United States,English,8/25/2017 22:47,8/25/2017 22:52,15,0,5,...,0,901182103806763008,901214508525527043,http://twitter.com/901182103806763008/statuses...,https://twitter.com/pol_warrior/status/9012145...,,,False,0,0.0
101661,3243189690,MONEYFORM,Rt for a picture of Obama and Biden @turner_is...,United States,English,6/21/2015 4:33,6/21/2015 4:33,0,14,5690,...,1,3243189690,612478391669751808,http://twitter.com/MoneyForm/statuses/61247839...,https://twitter.com/safety/unsafe_link_warning...,,,False,0,inf
152026,1930747698,NEVNOV_RU,Смольный откладывает захоронение останков дете...,United States,Russian,10/7/2015 9:30,10/7/2015 9:30,115,13035,19926,...,0,1930747698,651690947701571584,http://twitter.com/nevnov_ru/statuses/65169094...,https://twitter.com/nevnov_ru/status/651690947...,http://nevnov.ru/city/region/smolnyj-otkladyva...,,False,0,113.347826


### <i class="fa fa-graduation-cap"></i> Avoiding Slow Pandas [<i class="fa fa-forward"></i>](#skip_slow)

From [pandas](https://pandas.pydata.org/):
>pandas is a **fast**, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 

This is all true, with a pretty large caveat. Pandas is fast (and generally efficient), if you avoid some of the common pitfalls. Unfortunately, these traps are easy to fall for and many pandas users (even senior data scientists) don't know they might be slowing their code down 10-1000x. These people will often be hesitant to use pandas on large datasets and may dissuade others from using the library. 

However, by understanding a little about what is going on in the backend, we can avoid the worst of the problems and write relatively fast pandas code. 

#### How are DataFrames stored? 

DataFrames are really just a collection of `Series`, with each column corresponding to its own `Series`. In a `Series`, each item is stored one after the other in memory. This means that the entire column is stored within a single range of memory.

However, the multiple Series (columns) that make a DataFrame can be stored anywhere in memory and are often not stored side-by-side. 

We can think of this like a grocery list for sandwiches. Lets imagine that each kind of sandwich we make is composed of 1 type of bread, 1 type of  meat and 1 type of vegetable. We could arrange our grocery list into a table like this: 

| sandwich_id | bread_type | meat_type  | vegetable_type |
|-------------|------------|------------|----------------|
| 0           | sourdough  | ham        | lettuce        |
| 1           | baguette   | turkey     | tomato         |
| 2           | rye        | roast beef | onion          |

We buy all of our bread products from a bakery, meat from a deli, and vegetables from a grocer. The result is that to get everything in a column, you can go to one location (e.g. bakery for bread_type). But to get everything from a row you will have to visit all three locations. 

This means it is really fast to access an entire column, but really slow to access an entire row. Lets check it out!

In [39]:
print('Column\n------')
%timeit col = small_df['following']

print('Row\n------')
%timeit row = small_df.iloc[12]

Column
------
2.31 µs ± 34.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Row
------
146 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In this case, there was more than a 60x difference in speed! 

Understanding how `DataFrames` are stored in memory can help us understand why different approaches to `DataFrame` transformations vary so much in speed. Ideally, we can use this information to write efficient pandas code. 

#### For Loops & `iterrows`
Perhaps one of the most obvious ways to approach a transformation is to go row-by-row through the dataframe, doing the necessary transformations one at a time. A simple way to do this is using a `for` loop.



In [43]:
%%timeit
diff_followers = []
for i in small_df.index:
    row = small_df.loc[i]
    diff = row['followers'] - row['following']
    diff_followers.append(diff)

A second method we can use to add our new column is using the `iterrows()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows)). This is a built-in pandas method, which has been implemented to iterate over the rows in a  DataFrame. 

This method creates a [generator object](https://wiki.python.org/moin/Generators), a special Python object, which we can use a for loop to iterate over. 

In [44]:
%%timeit
diff_followers = []
for idx, row in small_df.iterrows():
    diff = row['followers'] - row['following']
    diff_followers.append(diff)

While this is definitely faster than the basic `for` loop approach, I'll tell you know that it is still really slow! In fact, the underlying reason why both of these approaches are so slow is the same.

Both approaches use a `for` loop to go row-by-row through the DataFrame. Gathering the data for that _row_ as its needed.   

In our sandwich example, this is the equivalent of buying ingredients for sandwich 1, then buying ingredients for sandwich 2, etc. This results in visiting each shop (bakery, deli, grocer) once for every sandwich recipe!

The same thing is happening in pandas. To iterate over the rows using a `for` loop we retreive all values for row 1, then all values for row 2, etc. 

This is incredibly inefficient (imagine the funny looks you'd get on your 3rd visit to the bakery)! In fact, I would venture to say that **you should never use for loops when working with pandas `DataFrames`**. There might be cases when I'm wrong, but there is almost always a better approach than `for` loops. 

#### The `apply()` Method
A third approach we can use is the `apply()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply)). This built-in pandas method applies a specific function across some axis &mdash; either rows or columns (More on this [later](#axes)). In our case, we want to apply a function along the column axis, applying the function to each row. 

To use `apply()`, you have to define the function you want to apply to each row. This function needs to take in a row, apply the function, and return some value. For our case, we'll define an `difference()` function. 

In [46]:
%%timeit

def difference(row):
    return row['followers'] - row['following']

diff_followers = small_df.apply(difference, axis=1)

38.2 ms ± 533 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


While this is significantly faster than the previous to approaches, it is still relatively slow because it continues to go row-by-row through the Dataframe. 

Since `apply()` is used for a specific purpose, pandas is able to make assumptions and include optmizations the general approaches (e.g. `for` loops) don't have access to.

For example, `apply()` checks to see if your function is compatible with its "fast" mode ([docs](https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/frame.py#L6737-L6928)). As well, it offloads some of the work to C (a low-level language known for speed), only performing the functions themselves in Python. 

Typically, **I almost always avoid using `apply()`**. Although, it does make for readable code.

#### `zip()` and Iterate
A fourth approach is to use Python's built-in `zip()` function ([docs](https://docs.python.org/3.3/library/functions.html#zip)). This function takes in a group of iterators (lists, dictionaries, tuples, etc) and creates a new iterator where the i-th element in the iterator will be a tuple containing the i-th elements from each of the original iterators. 

In [47]:
l1 = [1, 2, 3]
l2 = ['a', 'b', 'c']
z = zip(l1, l2)
list(z)

[(1, 'a'), (2, 'b'), (3, 'c')]

This function is useful for many different purposes. For our case, we will
1. Select columns needed for transformation
2. Zip these columns together
3. Iterate over the zipped object to retrieve pairs one at a time, applying a function to the pairs and storing the result in a list which will later become our new column

In [48]:
%%timeit
diff_followers = []
for followers, following in zip(small_df['followers'], small_df['following']):
    diff = followers - following
    diff_followers.append(diff)

299 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


This method offers a great improvement over our last method (~120x faster). This is the first method that avoids going row-by-row through the `DataFrame`.

Instead of performing many costly read operations (#rows x #columns), this method reads each column only once. The resulting data is stored temporarily in fast memory, where it can be accessed at little cost when it is needed for calculations.

**This is the method I typically use for complicated transformations that involve non-standard operations.**

#### Vectorized Functions
Depending on the transformation, we may be able to use a vectorized function. These functions operate on entire Series, rather individual values (aka vector functions). 

There are many built-in vectorized functions, such as `-` (shown below), `add()`, `between()`, and `shift`. You can also build your own vectorized function as a combination of these built-in methods.  

In [49]:
%%timeit
diff_followers = small_df['followers'] - small_df['following']

178 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


As with the `zip()` approach, vectorized functions avoid costly row-by-row reads. Vectorized functions also take advantage of pre-compiled code written in lower-level (and faster) languages such as C.

We can go one step further and convert pandas `Series` into NumPy arrays, applying the same vectorized functions to obtain our transformation. 

In [50]:
%%timeit
diff_followers = np.array(small_df['followers']) - np.array(small_df['following'])

46 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


By converting the pandas `Series` to NumPy arrays this method removes the overhead incurred by Pandas additional functionality.   

#### and Beyond
For cases when even these options aren't fast enough, you can implement more advanced techniques to enhance performance. The improvements offered by these advanced techniques differ based on the problem. For example, some techniques use functions and methods optimized for boolean comparisons (e.g. great than) but offer little improvements when working with other functions like addition. 

Some approaches to checkout include: 
* Using [NumExpr](https://pypi.org/project/numexpr/2.6.1/) for extra fast numerical expressions
* Rewriting functions in [Cython](https://cython.org/)
* Using [Numba](https://numba.pydata.org/) to convert Python code to fast machine code. 

#### Key Points
While differences in speed are hard (if not impossible) to notice for small datasets, it can become hugely consequential when working with large datasets or performing complex calculations. 

We have to remember optimizing code should not be placed at the expense of functionality. Often its best to get something that works before going back and finding the optimal solution. However, I hope that by introducing a couple of "Do's & Don'ts" your first insticts can help you avoid some of the easiest traps.

1. Never directly iterate over the rows in a DataFrame. Avoid anything that goes row-by-row.  
2. Working with NumPy arrays will be faster than pandas Series
3. DataFrame data is stored based on columns, not rows. This means its much faster to access a column than a row. 

<a id='skip_slow'></a>
# Aggregation and Grouped Operations

Some of the most common tasks in any given data analysis project involve some sort of aggregation or grouped operation. For example, we might want to compute and compare descriptive statistics for observations that take different values on a categorical variable. Let's see how to do that, and other grouped operations, with `pandas`. 

In brief, the `group_by()` method splits the `dataframe` into groups based on the values of a given variable. We can then perform operations on the resulting groups, such as computing descriptive statistics. 

In [28]:
grouped = df.groupby('account_category')
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

The code above returns a grouped object that we can work with. Let's say we want to pull out a specific group. We can use the `get_group()` method to pull a group from the grouped object. (Note that the `.get_group()` code below is equivalent to `df[df['account_type'] == 'RightTroll']`.) 

In [29]:
right_troll_group = grouped.get_group('RightTroll')
right_troll_group.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,2535818742,HAPPKENDRAHAPPY,Bosh situation hasn't changed since Feb. Heat ...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2285,...,Right,1,RightTroll,0,2535818742,779366817752084481,http://twitter.com/happkendrahappy/statuses/77...,,,
1,2535818742,HAPPKENDRAHAPPY,Youre an IDIOT! Now you say 99% when before yo...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2284,...,Right,1,RightTroll,0,2535818742,779366807446642688,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/erecordscity/status/779354...,,
2,2535818742,HAPPKENDRAHAPPY,Charlotte-Mecklenburg Fraternal Order of Polic...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2286,...,Right,1,RightTroll,0,2535818742,779366828585910272,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
3,2535818742,HAPPKENDRAHAPPY,Theodore Roosevelt's son Quentin and his frien...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2290,...,Right,1,RightTroll,0,2535818742,779367073998835712,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/HistoryInPics/status/73882...,,
4,2535818742,HAPPKENDRAHAPPY,.@flashfire451: Are there more cures than term...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2289,...,Right,1,RightTroll,0,2535818742,779367063479521281,http://twitter.com/happkendrahappy/statuses/77...,,,
5,2535818742,HAPPKENDRAHAPPY,Suspected Illegal Alien Marijuana Farmers Held...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2288,...,Right,1,RightTroll,0,2535818742,779367053077712896,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
6,2535818742,HAPPKENDRAHAPPY,This picture is 100% BOGUS! Just watch MSNBC &...,United States,English,9/23/2016 17:10,9/23/2016 17:10,1311,1688,2291,...,Right,1,RightTroll,0,2535818742,779367286385831937,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/micheleredding2/status/779...,,
7,2535818742,HAPPKENDRAHAPPY,Proud to be part-Polish! Poland Initially Appr...,United States,English,9/23/2016 17:11,9/23/2016 17:11,1311,1688,2293,...,Right,1,RightTroll,0,2535818742,779367574454800384,http://twitter.com/happkendrahappy/statuses/77...,http://www.lifenews.com/2016/09/23/poland-pois...,,
8,2535818742,HAPPKENDRAHAPPY,in your case Hillary if you do when you're go...,United States,English,9/23/2016 17:11,9/23/2016 17:12,1311,1688,2292,...,Right,1,RightTroll,0,2535818742,779367562782117889,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/cboutet11/status/778718668...,,
9,2535818742,HAPPKENDRAHAPPY,#ThingsMoreTrustedThanHillary Any fairy tale book,United States,English,9/27/2016 1:35,9/27/2016 1:37,1311,1686,2293,...,Right,0,RightTroll,0,2535818742,780581631342080001,http://twitter.com/happkendrahappy/statuses/78...,,,


As previously mentioned, sometimes we want to compute some value for a group within the dataset. We can do this by specifying the grouped object, the `Series` we want to perform an operation on, and finally the operation we want to perform. A full list of operations available when working with `Series` can be found in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html).

In [30]:
grouped['followers'].median()

account_category
Commercial        273
Fearmonger         48
HashtagGamer     2480
LeftTroll         836
NewsFeed        14722
NonEnglish        503
RightTroll       1437
Unknown           205
Name: followers, dtype: int64

In [31]:
grouped['following'].median()

account_category
Commercial         3
Fearmonger        65
HashtagGamer    2613
LeftTroll        796
NewsFeed        7089
NonEnglish       434
RightTroll      1864
Unknown          567
Name: following, dtype: int64

There are many things you can do here, such as comparing the ratio of followers to following. 

In [32]:
grouped['followers'].median() / grouped['following'].median()

account_category
Commercial      91.000000
Fearmonger       0.738462
HashtagGamer     0.949101
LeftTroll        1.050251
NewsFeed         2.076739
NonEnglish       1.158986
RightTroll       0.770923
Unknown          0.361552
dtype: float64

We can also perform some operations on the grouped object itself, such as computing the number of observations in each group, which in this case is equal to the number of tweets sent by accounts in each category. 

In [33]:
grouped.size().sort_values(ascending=False)

account_category
NonEnglish      820803
RightTroll      711668
NewsFeed        598226
LeftTroll       427141
HashtagGamer    241786
Commercial      121904
Unknown          13539
Fearmonger       11140
dtype: int64

It is also possible to group by multiple variables, such as `account_category` and `language`, and then perform an operation on the groups, such as compute the median number of followers. 

In [34]:
cat_lang = df.groupby(['account_category', 'language'], as_index=False)['followers'].median()
cat_lang.sample(30)

Unnamed: 0,account_category,language,followers
99,HashtagGamer,Hungarian,2745.0
237,NonEnglish,Indonesian,126.5
255,NonEnglish,Slovak,1220.0
187,NewsFeed,Farsi (Persian),13768.0
254,NonEnglish,Simplified Chinese,6865.0
226,NonEnglish,Estonian,160.0
246,NonEnglish,Malay,1523.5
172,LeftTroll,Turkish,769.5
245,NonEnglish,Macedonian,406.0
39,Commercial,Serbian,257.0


Depending on what you are doing, the result of a grouped analysis like this could be a `Series` or a `DataFrame`. 

Finally, we can perform *multiple* operations on a grouped object by using the `agg()` method ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html)). The `agg()` method will apply one or more aggregate functions to a grouped object, returning the results of each. 

To specify which operations `agg()` will apply, a list of functions or string function names is provided. Each function must accept a `Series` or `DataFrame` as input (depending on what type the grouped object is) or work when passed to the `apply()` method. 

We can re-implement the meridian calculation from above using the `agg()` function.

In [44]:
grouped['followers'].agg([np.median])

Unnamed: 0_level_0,median
account_category,Unnamed: 1_level_1
Commercial,273
Fearmonger,48
HashtagGamer,2480
LeftTroll,836
NewsFeed,14722
NonEnglish,503
RightTroll,1437
Unknown,205


We can specify additional functions by adding to the list provided to `agg()`. Notice that we can use a function or a string function name when specifying which operations to apply.

In [46]:
grouped['followers'].agg([min, np.median, 'max', 'count'])

Unnamed: 0_level_0,min,median,max,count
account_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Commercial,0,273,858,121904
Fearmonger,0,48,120,11140
HashtagGamer,0,2480,24663,241786
LeftTroll,0,836,56725,427141
NewsFeed,0,14722,62088,598226
NonEnglish,-1,503,251276,820803
RightTroll,0,1437,145244,711668
Unknown,0,205,6343,13539


We can even define our own function for `agg()` to use.

In [47]:
def count_greater_than_0(series):
    gt_0 = series[series > 0]
    return len(gt_0)

grouped['followers'].agg([min, np.median, 'max', 'count', count_greater_than_0])

Unnamed: 0_level_0,min,median,max,count,count_greater_than_0
account_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Commercial,0,273,858,121904,121704
Fearmonger,0,48,120,11140,10898
HashtagGamer,0,2480,24663,241786,240713
LeftTroll,0,836,56725,427141,426891
NewsFeed,0,14722,62088,598226,597888
NonEnglish,-1,503,251276,820803,817637
RightTroll,0,1437,145244,711668,703651
Unknown,0,205,6343,13539,12997


<a id='axes'></a>
### <i class="fa fa-graduation-cap"></i> Background Knowledge &mdash; Axes [<i class="fa fa-forward"></i>](#skip_axes)

Once again, feel free to skip this section. 

If you took a chance to delve into the documentation for any of the aggregation functions, you may have noticed an optional parameter called `axis`. The description of this parameter usually says something like: 

> **axis : {0 or ‘index’, 1 or ‘columns’}, default 0**   
If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

In other words, `axis=0` is going to operate over the columns of the dataframe and `axis=1` will operate over the rows in the dataframe. 

If we are doing tasks related to columns, such as calculating the median value of a column or sorting by a column, we will want to set `axis=0`. 

If we are doing tasks related to rows, such as dropping rows with missing values or using the `apply()` method to create a new column, we will want to set `axis=1`. 

As a bit of foreshadowing, operating over rows is generally very slow. We will touch on it later on during [a later]() Background Knowledge section.

<a id='skip_axes'></a>
## Sorting and Ranking

Sorting and ranking observations based on some criteria is a common data analysis task. For example, we might want to know which accounts in our dataset have the most followers. 

First, I will create a new `DataFrame` with some extra follower information for each account category & language group. 

In [70]:
new_cat_lang = df.groupby(['account_category', 'language'], as_index=False)['followers'].agg([min, np.median, max]).reset_index()
new_cat_lang.columns = ['account_category', 'language', 'followers_min', 'followers_median', 'followers_max']
new_cat_lang.sample(10)

Unnamed: 0,account_category,language,followers_min,followers_median,followers_max
16,Commercial,Gujarati,191,191.0,191
51,Commercial,Turkish,99,381.0,853
122,HashtagGamer,Tagalog (Filipino),9,2689.0,22638
191,NewsFeed,Hungarian,2,13156.0,16724
152,LeftTroll,Japanese,20,837.5,2427
324,Unknown,Farsi (Persian),0,61.0,92
259,NonEnglish,Swedish,0,697.5,6518
278,RightTroll,Finnish,2,119.0,32819
123,HashtagGamer,Thai,2147,2522.5,3751
17,Commercial,Hebrew,255,299.0,372


To start, we can sort `new_cat_lang` based on the median number of followers. 

In [71]:
new_cat_lang.sort_values('followers_median', ascending=False)[:10]

Unnamed: 0,account_category,language,followers_min,followers_median,followers_max
242,NonEnglish,LANGUAGE UNDEFINED,0,26395.0,251275
178,NewsFeed,Arabic,0,20700.0,33185
213,NewsFeed,Turkish,18603,19685.0,20994
209,NewsFeed,Somali,10,19622.0,31854
188,NewsFeed,Finnish,12288,19035.5,25733
292,RightTroll,LANGUAGE UNDEFINED,0,17448.0,32459
215,NewsFeed,Uzbek,6,17026.0,27637
216,NewsFeed,Vietnamese,1136,16555.0,61661
177,NewsFeed,Albanian,7,16110.0,31813
183,NewsFeed,Danish,12140,15565.0,18792


Using the same criteria, we can also rank the account groups based on which have the greatest number of median followers. The account with most followers will be given a rank of 1. 

In [74]:
new_cat_lang['followers_median'].rank(method='max')

0      130.0
1      140.0
2      123.0
3      155.0
4      151.0
       ...  
347     66.0
348     16.0
349     81.0
350     59.0
351     35.0
Name: followers_median, Length: 352, dtype: float64

If we were to save this value as a new column, we could use it later on for filtering, conducting more analyses, or highlighting important accounts in a visualization. More on this later.   

### Breaking Ties
Consider the sort below. 

In [75]:
new_cat_lang.sort_values('followers_median', ascending=True)[:10]

Unnamed: 0,account_category,language,followers_min,followers_median,followers_max
155,LeftTroll,LANGUAGE UNDEFINED,0,0.0,622
345,Unknown,Swedish,0,0.0,0
282,RightTroll,Gujarati,0,0.0,11
71,Fearmonger,LANGUAGE UNDEFINED,0,0.0,0
268,RightTroll,Bengali,1,1.0,1
269,RightTroll,Bulgarian,1,1.0,1
327,Unknown,German,0,7.0,220
214,NewsFeed,Urdu,10,10.0,10
340,Unknown,Russian,0,12.0,6343
249,NonEnglish,Portuguese,0,18.0,4348


As you can see, there are multiple account groups with 0 followers. In these cases it might be useful to break the tie using another column. 

To do this, we can specify multiple columns to be used during sorting. The importance of the columns in the sort is determined by the order in which they are provided. For example, in the cell below the `followers_median` column will be used to sort the data first, then the `followers_max` column will only be used to break ties in the original sort.  

In [76]:
new_cat_lang.sort_values(['followers_median', 'followers_max'], 
                         ascending=True)[:10]

Unnamed: 0,account_category,language,followers_min,followers_median,followers_max
71,Fearmonger,LANGUAGE UNDEFINED,0,0.0,0
345,Unknown,Swedish,0,0.0,0
282,RightTroll,Gujarati,0,0.0,11
155,LeftTroll,LANGUAGE UNDEFINED,0,0.0,622
268,RightTroll,Bengali,1,1.0,1
269,RightTroll,Bulgarian,1,1.0,1
327,Unknown,German,0,7.0,220
214,NewsFeed,Urdu,10,10.0,10
340,Unknown,Russian,0,12.0,6343
249,NonEnglish,Portuguese,0,18.0,4348


<font color='crimson'>I couldn't find a method for breaking ties in rank... Does one exist?</font>

## Correlation and Covariance 
Sometimes it is useful to calculate the correlation between numeric columns in a `DataFrame`. To do this you can use the `.corr()` method, which by default calculates the Pearson correlation coefficient, ignoring any missing (NA or null) values. 

$$\rho_{x,y} = \frac{\text{cov}(x,y)}{\sigma_x \sigma_y}$$

In [77]:
df.corr()

Unnamed: 0,following,followers,updates,retweet,new_june_2018,tweet_id
following,1.0,0.580259,0.15195,-0.305094,-0.150726,0.110589
followers,0.580259,1.0,0.233705,-0.312036,-0.049159,0.086571
updates,0.15195,0.233705,1.0,-0.17192,0.119216,0.14943
retweet,-0.305094,-0.312036,-0.17192,1.0,0.116437,-0.027388
new_june_2018,-0.150726,-0.049159,0.119216,0.116437,1.0,-0.353891
tweet_id,0.110589,0.086571,0.14943,-0.027388,-0.353891,1.0


Additionally, if you are interested in finding the correlation between two specific columns you can use the `.corr()` method for `Series`. 

In [51]:
df['followers'].corr(df['following'])

0.5802587806116586

Look at the [documentation]() for the `corr()` method. In the cell below use the `.corr()` method to find the spearman rank correlation between any two columns in the `DataFrame`. 

In [None]:
# Your Answer Here

Additionally, we can calculate the pairwise covariance between columns using the `.cov()` method.

$$ \mbox{cov}_{x,y}=\frac{\sum_{i=1}^{N}(x_{i}-\bar{x})(y_{i}-\bar{y})}{N-1}$$

Once again, this can be applied to entire `DataFrames`

In [52]:
df.cov()

Unnamed: 0,following,followers,updates,retweet,new_june_2018,tweet_id
following,31647870.0,47776530.0,15135000.0,-852.0629,-345.109,6.002876e+19
followers,47776530.0,214210700.0,60561680.0,-2267.201,-292.8336,1.222554e+20
updates,15135000.0,60561680.0,313486300.0,-1511.126,859.0936,2.552837e+20
retweet,-852.0629,-2267.201,-1511.126,0.2464509,0.02352615,-1311904000000000.0
new_june_2018,-345.109,-292.8336,859.0936,0.02352615,0.16565,-1.389762e+16
tweet_id,6.002876e+19,1.222554e+20,2.552837e+20,-1311904000000000.0,-1.389762e+16,9.310034e+33


Or to individual columns. 

In [54]:
df['followers'].cov(df['following'])

47776527.55856113

# Dates and Times
Many real world datasets include a temporal component, including the Russian Trolls dataset. Often, strings are used to store dates and times. However, strings don't take advantage of the unique properties of time. 

For example, it becomes difficult to sort dates if they are stored in strings with strange formats. This is because strings are sorted alphabetically, rather than based on what the string actually represents.

In [116]:
"Monday Mar 2, 1999" > "Friday Feb 21, 2020"

True

Additionally, it is often tedious to extract features of the date string such as day of the week, month, or timezone. 

This is why pandas and Python have implemented special types for date/time objects, called [`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html) and [`Datetime`](https://docs.python.org/2/library/datetime.html), respectively. These two types are essentially equivalent to one another.   

We can convert date strings from a column or `Series` into Timestamps using the `to_datetime` function. 

In [150]:
small_df['dt_publish_date'] = pd.to_datetime(small_df['publish_date'])
small_df['dt_harvested_date'] = pd.to_datetime(small_df['harvested_date'])

small_df.sample(5)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1,empty_tweet,num_hashtags,followers_following_ratio,dt_publish_date,dt_harvested_date
8213,2601235821,TODAYPITTSBURGH,Wind Chill Advisory Issued For Monday Night #...,United States,English,1/18/2016 20:52,1/18/2016 20:52,8465,15195,22550,...,689188617898467328,http://twitter.com/TodayPittsburgh/statuses/68...,,,,False,1,1.795038,2016-01-18 20:52:00,2016-01-18 20:52:00
53576,1443766015,NYAN_MEOW_MEOW,"'@danilanonstop2 Данила Даживу, ниче так, меня...",United Arab Emirates,Russian,7/20/2015 22:01,7/20/2015 22:04,177,2948,54462,...,623251382917824512,http://twitter.com/nyan_meow_meow/statuses/623...,,,,False,0,16.655367,2015-07-20 22:01:00,2015-07-20 22:04:00
241064,3717196514,NOVOSTIKLNGRD,Kaliningrad Street Food поделился планами на г...,Unknown,Russian,1/29/2016 11:41,1/29/2016 11:41,295,45,2023,...,693036169043378176,http://twitter.com/NovostiKlngrd/statuses/6930...,http://bit.ly/20aLx61,,,False,0,0.152542,2016-01-29 11:41:00,2016-01-29 11:41:00
19700,3805763416,NOVOSTIPENZA,«Мегафон» подарил детсаду книги для детей с на...,Unknown,Russian,10/10/2016 7:44,10/10/2016 7:44,288,50,4750,...,785385397245227008,http://twitter.com/NovostiPenza/statuses/78538...,http://bit.ly/2eidOW3,,,False,0,0.173611,2016-10-10 07:44:00,2016-10-10 07:44:00
240880,2752677905,TODAYNYCITY,Pedestrian struck and killed by a car in the B...,United States,English,10/11/2016 2:51,10/11/2016 2:51,6007,60975,42157,...,785674039805173760,http://twitter.com/TodayNYCity/statuses/785674...,http://nydn.us/2dLGxRv,,,False,0,10.150658,2016-10-11 02:51:00,2016-10-11 02:51:00


Lets check the type associated with the `dt_publish_date` column. 

In [134]:
small_df['dt_publish_date']

96879    2017-04-21 10:08:00
207838   2016-07-25 10:22:00
185630   2017-01-12 08:39:00
36216    2015-07-28 18:39:00
206471   2017-01-10 17:38:00
                 ...        
167065   2017-09-14 00:08:00
67538    2015-11-30 22:18:00
54294    2016-12-23 22:05:00
159433   2017-02-16 01:14:00
111010   2017-04-05 03:05:00
Name: dt_publish_date, Length: 1000, dtype: datetime64[ns]

Now that the column is stored in a datetime specific format, we can  access temporal specific attributes such as month, 

In [145]:
small_df['dt_publish_date'].dt.month

96879      4
207838     7
185630     1
36216      7
206471     1
          ..
167065     9
67538     11
54294     12
159433     2
111010     4
Name: dt_publish_date, Length: 1000, dtype: int64

Sort a `DataFrame` based on `publish_date`, 

In [147]:
small_df.sort_values(['dt_publish_date'])

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1,empty_tweet,num_hashtags,followers_following_ratio,dt_publish_date
16435,2572092279,HIPPPO_,"When life gives you 100 reasons to cry, show l...",United States,English,11/27/2014 15:04,11/27/2014 15:05,153,3,74,...,2572092279,537985212676710400,http://twitter.com/hipppo_/statuses/5379852126...,,,,False,2,0.019608,2014-11-27 15:04:00
108357,2503202888,KRISTYANANN,A week is a long time in politics.,United States,English,12/1/2014 10:33,12/1/2014 10:33,270,69,785,...,2503202888,539366644779200512,http://twitter.com/KristyanaNN/statuses/539366...,,,,False,0,0.255556,2014-12-01 10:33:00
141540,2533221819,LAZYKSTAFFORD,A relationship is not based on the length of t...,United States,English,12/23/2014 14:32,12/23/2014 14:32,594,51,62,...,2533221819,547399263706415104,http://twitter.com/LazyKStafford/statuses/5473...,,,,False,0,0.085859,2014-12-23 14:32:00
214074,2496599688,NOTRITAHART,'@Sophie_kole it's stunning http://t.co/evTqfg...,United States,Icelandic,1/20/2015 7:46,1/20/2015 7:46,360,114,396,...,2496599688,557443940865949696,http://twitter.com/NotRitaHart/statuses/557443...,http://vimeo.com/97585553,,,False,0,0.316667,2015-01-20 07:46:00
71881,2951556370,SPECIALAFFAIR,#WorldNews U.S. deploys search and rescue heli...,United States,English,2/5/2015 3:34,2/5/2015 3:34,622,53,1024,...,2951556370,563178737584189440,http://twitter.com/SpecialAffair/statuses/5631...,http://bit.ly/1KfhEWz,,,False,1,0.085209,2015-02-05 03:34:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174868,743167000000000000,COVFEFENATIONUS,'@JaydaBF https://t.co/Ij0ut5HXDL',United States,English,11/29/2017 17:43,11/29/2017 17:43,245,2536,156216,...,743166519157227520,935927123428048896,http://twitter.com/743166519157227520/statuses...,https://twitter.com/CovfefeNationUS/status/935...,,,False,0,10.351020,2017-11-29 17:43:00
177809,743167000000000000,COVFEFENATIONUS,".@TuckerCarlson: ""Democrats have made no meani...",United States,English,12/1/2017 22:47,12/1/2017 22:47,246,2530,158068,...,743166519157227520,936728389536190465,http://twitter.com/743166519157227520/statuses...,https://twitter.com/FoxNews/status/93672705396...,,,False,1,10.284553,2017-12-01 22:47:00
181876,743167000000000000,COVFEFENATIONUS,Scoop coming. Why I think it’s MCCABE WHO DIDD...,United States,English,12/5/2017 4:06,12/5/2017 4:06,252,2583,161414,...,743166519157227520,937895867985510400,http://twitter.com/743166519157227520/statuses...,https://twitter.com/thomas1774paine/status/937...,,,False,0,10.250000,2017-12-05 04:06:00
26588,912394000000000000,BARBARAFORTRUMP,Belief. Confidence. Reliance. We support you M...,United States,English,12/15/2017 2:37,12/15/2017 2:37,2000,861,223,...,912393907069059072,941497362601598976,http://twitter.com/912393907069059072/statuses...,https://twitter.com/BarbaraForTrump/status/941...,,,False,2,0.430500,2017-12-15 02:37:00


and add or subtract datetime columns to create new columns.

In [159]:
small_df['days_until_harvest'] = small_df['dt_harvested_date'] - small_df['dt_publish_date']
small_df.sample(5)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,tco1_step1,tco2_step1,tco3_step1,empty_tweet,num_hashtags,followers_following_ratio,dt_publish_date,dt_harvested_date,days_online,days_until_harvest
223185,1513801268,YOUJUSTCTRLC,I remember Donnie Simpson making fun of Brian ...,United States,English,1/26/2017 5:57,1/26/2017 5:58,2656,2717,3251,...,,,,False,0,1.022967,2017-01-26 05:57:00,2017-01-26 05:58:00,00:01:00,00:01:00
127453,1658202894,LAURABAELEY,"""#TrumpBecause Just 1 more thing 4 Trump to do...",United States,English,8/13/2015 16:28,8/13/2015 16:28,284,313,1094,...,,,,False,1,1.102113,2015-08-13 16:28:00,2015-08-13 16:28:00,00:00:00,00:00:00
256011,2611151319,SEATTLE_POST,CEO says ad-free CBS All Access for $10 is ‘ve...,United States,English,11/4/2015 15:30,11/4/2015 15:30,3540,11506,13395,...,http://bit.ly/1RTeP0k,,,False,1,3.250282,2015-11-04 15:30:00,2015-11-04 15:30:00,00:00:00,00:00:00
145287,2573225349,DOMINICVALENT,#ImStillLookingFor my Life Alert so I can get ...,United States,English,1/17/2016 13:47,1/17/2016 13:47,1330,1778,1381,...,,,,False,1,1.336842,2016-01-17 13:47:00,2016-01-17 13:47:00,00:00:00,00:00:00
126199,1687183549,BLEEPTHEPOLICE,Those cops where gang banging that night and w...,United States,English,4/30/2016 21:49,4/30/2016 21:49,6502,7903,9479,...,https://twitter.com/BleepThePolice/status/7265...,,,False,1,1.215472,2016-04-30 21:49:00,2016-04-30 21:49:00,00:00:00,00:00:00


In the cell below use methods found [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) to create a new column called `weekend`. It should be `True` if the tweet was published on a Saturday or Sunday & `False` otherwise. 

In [None]:
# Your Answer Here

# Missing Data
Missing data is a common occurrence when working with real-world datasets. Data can be missing for multiple reasons. What are some of the reasons you are familiar with? Try to think of at least 3.

In [80]:
# Your answer here
#
#
#

Once you recognize that any `Series` or `DataFrame` corresponding to a real-world dataset is likely to have missing values, you're probably wondering how these missing data are stored in pandas. 

Generally, pandas uses the `np.nan` value to represent missing data. See the table below for some examples of rows containing missing data (Scroll to the far right columns). 

In [67]:
df.head(5)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,2535818742,HAPPKENDRAHAPPY,Bosh situation hasn't changed since Feb. Heat ...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2285,...,Right,1,RightTroll,0,2535818742,779366817752084481,http://twitter.com/happkendrahappy/statuses/77...,,,
1,2535818742,HAPPKENDRAHAPPY,Youre an IDIOT! Now you say 99% when before yo...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2284,...,Right,1,RightTroll,0,2535818742,779366807446642688,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/erecordscity/status/779354...,,
2,2535818742,HAPPKENDRAHAPPY,Charlotte-Mecklenburg Fraternal Order of Polic...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2286,...,Right,1,RightTroll,0,2535818742,779366828585910272,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
3,2535818742,HAPPKENDRAHAPPY,Theodore Roosevelt's son Quentin and his frien...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2290,...,Right,1,RightTroll,0,2535818742,779367073998835712,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/HistoryInPics/status/73882...,,
4,2535818742,HAPPKENDRAHAPPY,.@flashfire451: Are there more cures than term...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2289,...,Right,1,RightTroll,0,2535818742,779367063479521281,http://twitter.com/happkendrahappy/statuses/77...,,,


NumPy's `np.nan` value is a special case of a floating point number representing an unrepresentable value. These kinds of values are called NaNs (Not a Number).  

In [81]:
type(np.nan)

float

`np.nan` cannot be used in equality tests, since any comparison to a `np.nan` value will evaluate as `False`. This includes comparing `np.nan` to itself. 

In [83]:
n = np.nan
n == n 

False

As well, `np.nan` values do not evaluate to `False` or `None`. This makes it difficult to distinguish missing values. Luckily, we can use the `np.isna()` function for this purpose. This is especially useful in control flow.

In [88]:
if np.nan is None:
    print('NaN is None')
if np.nan:
    print('NaN evaluates to True in control flow')
if np.isnan(np.nan):
    print('NaN is considered a NaN value in NumPy')

NaN evaluates to True in control flow
NaN is considered a NaN value in NumPy


Additionally, `np.nan` values are generally excluded from pandas functions that perform calculations over dataframes, rows, or columns. For example, documentation often stipulates that a calculation is done over all values, excluding NaN or NULL values. 

In [106]:
total = len(df['tco1_step1'])
count = df['tco1_step1'].count()
print('Total: {}'.format(total))
print('Count: {}'.format(count))
print(' Diff:  {}'.format(total-count))

Total: 2946207
Count: 2100236
 Diff:  845971


The total number of items in the `tco1_step` column is nearly 85000 more than the counts received from the `count()` function. If what we learned above is correct, this difference should be accounted for when we discover how many items in this column are NaNs. 

In [107]:
nans = df['tco1_step1'].isna().sum()
print(' NaNs: {}'.format(nans))

 NaNs: 845971


In [None]:
 can be really useful for 

The `.isna()` method can be useful in transforming and filtering data. Use the `.isna()` function, along with the methods we covered in the filtering section to only show rows in `df` where the region column is not missing. Save the resulting `DataFrame` as `region_df`.  

In [109]:
# Your Answer Here

In [None]:
# Run this cell to check
if len(region_df) >= len(df):
    print('Nothing was filtered out, try again!')
else:
    print('At least one value was filtered out of the DataFrame. It is up to you to make sure the right values were removed!')

# Melting
# Pivots
The `pivot_table()` function can be used to create pivot tables similar to those commonly used in spreadsheets. 

## Selecting, Aggregating, Summarizing, and Subsetting <a id='sass'></a>

In `Pandas`, each individual column / variable is called a `Series`. To select a `Series` from our `dataframe`, we simply have to type the `Series` name inside square brackets following the name of the `dataframe` object itself. For example, to get the `Series` containing country names, we would type:

In [None]:
df['account_type']

We can perform simple operations on these individual `Series`, such as counting the number of observations per account type in the dataset. 

In [None]:
obs_by_type = df['account_type'].value_counts()
obs_by_type.sort_values(ascending = False)

We can create smaller `dataframes` by passing in a list of the `Series` we want to include, or by subsetting the observations based on some criteria. In the first case, we produce a smaller `dataframe` by selecting specific variables. In the second case we produce a smaller `dataframe` by selecting specific observations. For example, if we wanted to select the variables for following, followers, and account type, we could do the following: 

In [None]:
small = df[['following', 'followers', 'account_type']]
small.sort_values(['following'], ascending = False).head(40)

The second way we might want to subset our data is by selecting some subset of observations. For example, if we wanted to pull out a subset of our data for right troll accounts, we could do the following:

In [None]:
right_trolls = df[df['account_type'] == 'Right']
right_trolls.sample(10)

`Pandas` makes it easy to perform fundamental data operations, such as grouping by one variable and computing a mean value for another. Let's say, for example, that we wanted to group our dataset by account type and then get the median number of followers. (Note that I am not suggesting you should do this in a real data analysis, I am simply demonstrating `Pandas` functionality.) 

In [None]:
median_followers = df.groupby('account_type')['followers'].median().round(0)
median_followers.sort_values(ascending = False)[:35]

# Open Work Time <a id='open'></a>