# FUNDAMENTALS OF DATA ANALYSIS WITH PYTHON <br><font color="crimson">DAY 3: SOCIAL SCIENTIFIC COMPUTING WITH PANDAS</font>

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>

### Overview 

This notebook introduces some fundamentals of scientific computing with `Pandas` and `matplotlib`. `Pandas` is an extremely popular Python package for storing, manipulating, and analyzing data in a tabular form, with rows and columns. We will learn how to get data into `pandas`, and then how to perform common data analysis tasks such as selecting columns, filtering rows, and computing descriptive statistics. Then we will learn how to use `matplotlib` for producing high-quality plots for print or the web. We will use it to create a variety of common statistical plots and other visualizations. 

### Plan for the Day

1. [`Pandas` 101](#pandas)
2. [Best practices for `Pandas`](#pandasbp)
3. [Open Work Time](#open)

<hr>

In [3]:
import os
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'svg' # better resolution with vector graphics! 

# `Pandas` 101<a id='pandas'></a>

Quantitative or computational social scientists are used to working with data in tabular form, such as a `dataframe` with variables in the columns and observations in the rows. In Python, the `Pandas` package enables us to organize, manipulate, and analyze data in this familiar way. 

`Pandas` is an extremely popular package in the scientific computing community regardless of the discipline (physics, sociology, neuroscience, history) or industry (academia, government, industry). It was originally developed for time series analysis. It gets it's name from **pan**el **da**ta. 

This part of the notebook covers some essential functionality of `Pandas` that you will make heavy use of in most data analyses. Of course, we will not cover *everything* that is possible to do with `Pandas`. As with the previous content, the goal is to build a basic foundation that we can build on throughout the week. We will emphasize the functionality that can take you the furthest in any given data analysis project. 

## Reading Data from Files 

`Pandas` makes it easy to load data from an external file directly into a `DataFrame`, which will discuss momentarily. It does so using one of many `reader` functions that are part of a suite of `I/O` (input / output, read / write) tools. For some common examples, see the table below. Information on these and other `reader` functions can be found in the `pandas` [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). The [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) also provides useful information about the parameters for these methods, such as how to specify what sheet you want from an Excel spreadsheet, or whether to write the index to a new `csv` file. 



| Data Description                | Reader          | Writer        |
|:--------------------------------|:----------------|:--------------|
| CSV                             | `read_csv()`   | `to_csv()`   |
| JSON                            | `read_json()`  | `to_json()`  |
| MS Excel and OpenDocument (ODF) | `read_excel()` | `to_excel()` |
| Stata                           | `read_stata()` | `to_stata()` |
| SAS                             | `read_sas()`   | NA            |
| SPSS                            | `read_spss()`  | NA            |


To illustrate how these `reader` functions work, we will use the `read_csv()` function. The only *required* argument is that we provide the path to the location of the file on our computer. 

In this case, we will use the ["Three Million Russian Trolls" dataset](https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/), which consists of data on ~3M tweets from Twitter accounts that are known to be part of state-sponsored disinformation campaigns. This particular dataset was collected and coded by Darrin Linvill and Patrick Warren, of Clemson University. It includes several variables that were hand coded by Linvill and Warren, the most important of which are classifications of accounts into different types. 

The dataset is stored in 12 different `csv` files. They are stored in a directory called `russian-troll-tweets`, which is inside the `data` directory.

In [4]:
!ls data/russian-troll-tweets

IRAhandle_tweets_10.csv  IRAhandle_tweets_2.csv  IRAhandle_tweets_7.csv
IRAhandle_tweets_11.csv  IRAhandle_tweets_3.csv  IRAhandle_tweets_8.csv
IRAhandle_tweets_12.csv  IRAhandle_tweets_4.csv  IRAhandle_tweets_9.csv
IRAhandle_tweets_13.csv  IRAhandle_tweets_5.csv  README.md
IRAhandle_tweets_1.csv	 IRAhandle_tweets_6.csv


Let's start by loading just one of the files. Later we will see how to read in all 12 and combine them into 1 large dataset. 

In [5]:
df = pd.read_csv('data/russian-troll-tweets/IRAhandle_tweets_1.csv')

By default, `pandas` assumes your data is encoded with `UTF-8`. If you see an encoding error, you can switch to a different encoding, such as `latin`.

Once we have our `dataframe`, we can use the `info()` method to see the name of each column, as well as it's integer index and datatype. 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243891 entries, 0 to 243890
Data columns (total 21 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   external_author_id  243891 non-null  int64 
 1   author              243891 non-null  object
 2   content             243891 non-null  object
 3   region              243853 non-null  object
 4   language            243891 non-null  object
 5   publish_date        243891 non-null  object
 6   harvested_date      243891 non-null  object
 7   following           243891 non-null  int64 
 8   followers           243891 non-null  int64 
 9   updates             243891 non-null  int64 
 10  post_type           154592 non-null  object
 11  account_type        243891 non-null  object
 12  retweet             243891 non-null  int64 
 13  account_category    243891 non-null  object
 14  new_june_2018       243891 non-null  int64 
 15  alt_external_id     243891 non-null  int64 
 16  tw

We now have a `dataframe` with 21 variables. The `dataframe` is organized as we would expect: with variables in the columns and observations in the columns. We can use the `.head()` method to preview the top $n$ rows of the dataset. 

In [7]:
df.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,906000000000000000,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,...,Right,0,RightTroll,0,905874659358453760,914580356430536707,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/914580356430...,,
1,906000000000000000,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,...,Right,0,RightTroll,0,905874659358453760,914621840496189440,http://twitter.com/905874659358453760/statuses...,https://twitter.com/damienwoody/status/9145685...,,
2,906000000000000000,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,10/1/2017 22:50,10/1/2017 22:51,1054,9637,255,...,Right,1,RightTroll,0,905874659358453760,914623490375979008,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/913231923715...,,
3,906000000000000000,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,English,10/1/2017 23:52,10/1/2017 23:52,1062,9642,256,...,Right,0,RightTroll,0,905874659358453760,914639143690555392,http://twitter.com/905874659358453760/statuses...,https://twitter.com/10_gop/status/914639143690...,,
4,906000000000000000,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,English,10/1/2017 2:13,10/1/2017 2:13,1050,9645,246,...,Right,1,RightTroll,0,905874659358453760,914312219952861184,http://twitter.com/905874659358453760/statuses...,https://twitter.com/realDonaldTrump/status/914...,,
5,906000000000000000,10_GOP,"Dan Bongino: ""Nobody trolls liberals better th...",Unknown,English,10/1/2017 2:47,10/1/2017 2:47,1050,9644,247,...,Right,0,RightTroll,0,905874659358453760,914320835325853696,http://twitter.com/905874659358453760/statuses...,https://twitter.com/FoxNews/status/91423949678...,,
6,906000000000000000,10_GOP,🐝🐝🐝 https://t.co/MorL3AQW0z,Unknown,English,10/1/2017 2:48,10/1/2017 2:48,1050,9644,248,...,Right,1,RightTroll,0,905874659358453760,914321156466933760,http://twitter.com/905874659358453760/statuses...,https://twitter.com/Cernovich/status/914314644...,,
7,906000000000000000,10_GOP,'@SenatorMenendez @CarmenYulinCruz Doesn't mat...,Unknown,English,10/1/2017 2:52,10/1/2017 2:53,1050,9644,249,...,Right,0,RightTroll,0,905874659358453760,914322215537119234,http://twitter.com/905874659358453760/statuses...,,,
8,906000000000000000,10_GOP,"As much as I hate promoting CNN article, here ...",Unknown,English,10/1/2017 3:47,10/1/2017 3:47,1050,9646,250,...,Right,0,RightTroll,0,905874659358453760,914335818503933957,http://twitter.com/905874659358453760/statuses...,http://www.cnn.com/2017/09/27/us/puerto-rico-a...,,
9,906000000000000000,10_GOP,After the 'genocide' remark from San Juan Mayo...,Unknown,English,10/1/2017 3:51,10/1/2017 3:51,1050,9646,251,...,Right,0,RightTroll,0,905874659358453760,914336862730375170,http://twitter.com/905874659358453760/statuses...,,,


Alternatively, we could use the `.sample()` method to pull a random sample of $n$ observations, which can be helpful if we don't want the observations we preview to be from the top (`head`) or bottom (`tail`) of the dataset.

In [8]:
df.sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
57753,3083086600,ALDRICH420,Obama and the left want to take your gun right...,United States,English,3/27/2016 15:34,3/27/2016 15:34,960,1049,1015,...,Right,0,RightTroll,0,3083086600,714113410351439872,http://twitter.com/Aldrich420/statuses/7141134...,https://twitter.com/NYC4TRUMP2016/status/69485...,,
2223,839000000000000000,1LORENAFAVA1,"Dal neolitico al 2017, il design sloveno parla...",Italy,Italian,4/16/2017 8:15,4/16/2017 8:15,407,89,2595,...,Italian,1,NonEnglish,0,838742761515991041,853522301622534146,http://twitter.com/838742761515991041/statuses...,http://www.glistatigenerali.com/milano_moda-de...,,
167816,895000000000000000,ANIIANTRS,Hillary’s SuperPac BUSTED Violating Federal El...,Unknown,English,9/30/2017 17:08,9/30/2017 17:08,4897,1795,4624,...,Right,0,RightTroll,0,894845726840283136,914175041855721472,http://twitter.com/894845726840283136/statuses...,http://ift.tt/2xGmbE8,,
141249,2573356007,ANDEYNESTEROV,Зачем Украина зазывает на допрос экс-советника...,United States,Russian,3/19/2017 16:35,3/19/2017 16:36,148,102,346,...,Russian,1,NonEnglish,0,2573356007,843501205842247680,http://twitter.com/2573356007/statuses/8435012...,https://twitter.com/GazetaRu/status/8434994348...,https://www.gazeta.ru/politics/2017/03/18_a_10...,
113928,1679279490,AMELIEBALDWIN,You do know that to the 5.5 Americans with dem...,United States,English,3/17/2017 3:05,3/17/2017 3:05,2309,2743,32838,...,Right,1,RightTroll,0,1679279490,842572622223331328,http://twitter.com/1679279490/statuses/8425726...,https://twitter.com/Lawrence/status/8192630385...,,
157666,895000000000000000,ANGEELISHET,#ste Trump Cracks Down on Food Stamp Fraud and...,Unknown,English,8/8/2017 7:36,8/8/2017 7:36,50,0,6,...,Right,1,RightTroll,0,894823022317752320,894824642485129216,http://twitter.com/894823022317752320/statuses...,https://twitter.com/sterllarTR/status/89474078...,http://ift.tt/2wCtdWp,
219154,1671936266,ARM_2_ALAN,#tech Clowns required for public hospitals in ...,United States,English,6/8/2015 19:53,6/8/2015 19:54,68,152,5876,...,Right,1,RightTroll,0,1671936266,607999012986363904,http://twitter.com/Arm_2_Alan/statuses/6079990...,,,
72674,2256083366,ALEXXBELYAEV,В сети появилось видео со свадьбы Пескова и На...,United States,Russian,8/1/2015 19:11,8/1/2015 19:11,1679,630,9139,...,Russian,1,NonEnglish,1,2256083366,627557260248645632,http://twitter.com/AlexxBelyaev/statuses/62755...,https://twitter.com/GazetaRu/status/6275483815...,http://www.gazeta.ru/social/news/2015/08/01/n_...,
178442,2500690416,ANNIEPOSHES,#ToAvoidWorkI throw one of my world class hiss...,United States,English,9/5/2016 14:42,9/5/2016 14:42,1657,1777,2719,...,Hashtager,0,HashtagGamer,1,2500690416,772807193242505216,http://twitter.com/AnniePoshes/statuses/772807...,,,
25366,1652138929,ACEJINEV,Now Playing: RickStarr (@SlickRickstarr) - Rea...,United States,English,8/8/2017 5:30,8/8/2017 5:31,775,906,7222,...,Left,1,LeftTroll,0,1652138929,894792994448191488,http://twitter.com/1652138929/statuses/8947929...,http://1063.mobi,,


To load up the full dataset -- which is spread across 12 files -- we can read in each `csv` file and concatenate them all into a single `dataframe`. Note that if your data is contained in a single file, this step would not be necessary. 

In [9]:
data_dir = os.listdir('data/russian-troll-tweets')
data_dir

['README.md',
 'IRAhandle_tweets_6.csv',
 'IRAhandle_tweets_10.csv',
 'IRAhandle_tweets_13.csv',
 'IRAhandle_tweets_11.csv',
 'IRAhandle_tweets_3.csv',
 'IRAhandle_tweets_12.csv',
 'IRAhandle_tweets_5.csv',
 '.ipynb_checkpoints',
 'IRAhandle_tweets_7.csv',
 'IRAhandle_tweets_9.csv',
 'IRAhandle_tweets_1.csv',
 'IRAhandle_tweets_8.csv',
 'IRAhandle_tweets_4.csv',
 'IRAhandle_tweets_2.csv']

In [10]:
files = [f for f in data_dir if 'csv' in f]
files 

['IRAhandle_tweets_6.csv',
 'IRAhandle_tweets_10.csv',
 'IRAhandle_tweets_13.csv',
 'IRAhandle_tweets_11.csv',
 'IRAhandle_tweets_3.csv',
 'IRAhandle_tweets_12.csv',
 'IRAhandle_tweets_5.csv',
 'IRAhandle_tweets_7.csv',
 'IRAhandle_tweets_9.csv',
 'IRAhandle_tweets_1.csv',
 'IRAhandle_tweets_8.csv',
 'IRAhandle_tweets_4.csv',
 'IRAhandle_tweets_2.csv']

We will overwrite the `df` created earlier. 

In [11]:
df = pd.concat((pd.read_csv('data/russian-troll-tweets/{}'.format(f), encoding='utf-8', low_memory=False) for f in files if 'csv' in f))
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2946207 entries, 0 to 250519
Data columns (total 21 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   external_author_id  object
 1   author              object
 2   content             object
 3   region              object
 4   language            object
 5   publish_date        object
 6   harvested_date      object
 7   following           int64 
 8   followers           int64 
 9   updates             int64 
 10  post_type           object
 11  account_type        object
 12  retweet             int64 
 13  account_category    object
 14  new_june_2018       int64 
 15  alt_external_id     object
 16  tweet_id            int64 
 17  article_url         object
 18  tco1_step1          object
 19  tco2_step1          object
 20  tco3_step1          object
dtypes: int64(6), object(15)
memory usage: 494.5+ MB


In this case, we have two datatypes in our `dataframe`: `object` and `int64`. `Pandas` uses `object` to refer to columns that contain `strings`, or which contain mixed types, such as `strings` and `integers`. In this case, they refer to `strings`. `int64` are integers. In addition to these two data types, `pandas` stores `floats` (`float64`), booleans (True or False), several specialized `datetime` data structures, and categorical variables.  

One further thing to note about this dataset: **each row is a tweet from a specific account, but some of the variables describe attributes of the tweeting accounts, not of the tweet itself**. For example, `followers` describes the number of followers that the account had at the time it sent the tweet. This makes sense, because tweets don't have followers, but accounts do. We need to keep this in mind when working with this dataset. 

## Understanding `Pandas` Data Structures 

Now that we have a `dataframe` loaded into memory, we can move on to some interesting data analyses. But first, let's devote a bit of time to clarifying `pandas` data structures. 

### Background Knowledge &mdash; Dynamic Typing [<i class="fa fa-forward"></i>](#skip_dynamic)

> Note: feel free to use the [<i class="fa fa-forward"></i>](#essential_series) button above to temporarily [skip](#essential_series) over this "background knowledge" section if you are feeling overwhelmed with new information. It is useful to know, but it is not *essential* knowledge for using Pandas to analyze data. You will not lose much if you come back to this at some point in the future, when you are more comfortable with basic `pandas` data structures and operations. 

First, some background knowledge. Python is a dynamically typed language. What that means is that you don't need to constantly tell Python what kind of object something is. For example, if you add two numbers together

In [12]:
42 + 8

50

it is not necessary to tell Python that `42` and `8` are integers. Instead, Python stores that metadata in each object. 

When we store data in a list, every element in the list is actually a Python object itself, containing not only the actual data itself (e.g. `42`), but also information about the **type** of data that it is, which in this case is `int64`. This is enormously useful in many cases, because we store objects of different types in a `list`.

In [13]:
some_data = [42, 8.0, 'a string']
print(some_data)

[42, 8.0, 'a string']


In this example, each element in `some_data` also contains information about the type of object it is. As previously mentioned, this is enormously helpful in some contexts, but dramatically slows down computation in other contexts. Data analysis is one example of where, depending on what you are trying to do, dynamic typing can slow things down rather a lot. 

When you are analyzing data, you are almost always working with some collection of elements that are all of the same type, such as integers, floats, strings, or Boolean values. For example, you can't compute the mean and standard deviation of a collection of elements that include both integers and strings. So it follows that data analysis can be made more efficient by working on data structures where information about data types is stored at the level of the collection itself rather than each element in the collection, *provided the data is all of the same type*. 

One of the main tools for doing this in `Python` is the `numpy` package, which is more or less the foundation of all data analysis in `Python`, whether you explicitly use it or not. `numpy` provides data structures for working with `arrays` of data that are a bit like lists except that all elements are of the same type, information about that type is stored at the level of the `array` itself, and each element in the `array` has an explicit integer index. `arrays` can be one dimensional vectors or multi-dimensional matrices. 

Further discussion of `numpy` is beyond the scope of this class. For our purposes here, what you need to know is that `pandas` builds on top of `numpy` and offers an additional set of data structures and methods that are designed explicitly to meet the needs to researchers working with real-world empirical data. Like `numpy`, `pandas` is designed to make scientific computing more efficient, but as we will learn below there are some common pitfalls to avoid that, if you are not careful, can actually make working with `pandas` slow and inefficient. 

> **Jillian**, where do you think we should add some information about `axes` for `dataframes`? We only need a sentence or two I think, but I am not sure where the best place to put it is. 

<a id='skip_dynamic'></a>
## Back to Essential `Pandas`: `Series` and `index`

Each column in a `dataframe` is an object called a `series`. A `series` is a one-dimensional object, such as a vector of numbers. However, that vector is associated with an `index`, which is a vector, or array, of labels. 

For example, the column `retweet` in our Russian troll `dataframe` is a `series` of integers (number of times a tweet was retweeted) and their `index` labels. 

In [14]:
num_followers = df['followers']
type(num_followers)

pandas.core.series.Series

Below, we pull a sample of 25 tweets from the `series`. The value on the left is the index label for the observation, the number on the right is the actual data value (number of retweets). The index values are sequential in the actual `series`, but they are out of sequence here because we pulled a random sample. 

In [15]:
num_followers.sample(25)

7374          1
191609    19654
178822      831
65495      2224
228612      884
232589    20131
136330      257
182677     2599
165012        4
51572       126
44338       224
46082       352
155737       67
2501         55
38537     13319
237324      703
248651     2374
154253    12280
11619       537
197990    17453
238573    13253
59493      1304
73443         4
16111     40343
236439     4201
Name: followers, dtype: int64

In most cases, the default `index` for a `series` or `dataframe` is an immutable vector of integers:

In [16]:
num_followers.index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            250510, 250511, 250512, 250513, 250514, 250515, 250516, 250517,
            250518, 250519],
           dtype='int64', length=2946207)

In some cases, such as time series analysis, the `index` might default to a `DatetimeIndex` or a `PeriodIndex`, but we will not consider those in this course. If you are working with time series data, the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) provides explanations of how to use these types of `indices`.

We can easily modify an `index` so that it is made of up some other type of vector instead, including a `string`. Surprisingly, `index` values do not need to be unique (technically, they are a `multiset`, or a `set` that is allowed to have repeat elements). This enables us to do some powerful things, but most of the time you should avoid manually changing indexes. 

We can use the `index` to retrieve specific values from a `series` much as we would if we were selecting an element from a `list`, `tuple`, or `array`.

### Operations on `Series`: Descriptive Statistics

As we will soon see, there are a number of operations we can perform on `Series`, such as simple descriptive statistics like mean, median, mode, and standard deviation.

In [17]:
print('Median ', num_followers.median())
print('Mean ', num_followers.mean())
print('Standard Deviation ', num_followers.std())

Median  1274.0
Mean  7055.265491868019
Standard Deviation  14635.939344602943


Since the values returned from operations on `Series` are essentially equivalent to a `numpy` `array`, we can use `numpy` methods on `Series` objects. For example, we can use the `round()` method from `numpy` to round these descriptives to a few decimal points. 

In [18]:
print('Median ', np.round(num_followers.median(), 3))
print('Mean ', np.round(num_followers.mean(), 3))
print('Standard Deviation ', np.round(num_followers.std(), 3))

Median  1274.0
Mean  7055.265
Standard Deviation  14635.939


We can also `count` the number of non-missing observations in a `series`

In [19]:
num_followers.count()

2946207

or get an overview of multiple descriptives at once:

In [20]:
num_followers.describe()

count    2.946207e+06
mean     7.055265e+03
std      1.463594e+04
min     -1.000000e+00
25%      3.220000e+02
50%      1.274000e+03
75%      1.085300e+04
max      2.512760e+05
Name: followers, dtype: float64

If our series is categorical, we can also easily compute useful information such as the number of unique categories, the size of each category, and so on. For example, let's look at the `account_type` `series`.

In [21]:
atype = df['account_type']

In [22]:
atype.unique()

array(['Right', 'Russian', '?', 'Koch', 'Hashtager', 'Commercial', 'Left',
       'local', 'Arabic', 'news', 'German', 'Spanish', 'French',
       'Italian', 'Ebola ', 'Portuguese', 'Uzbek', 'Ukranian',
       'ZAPOROSHIA'], dtype=object)

In [23]:
atype.value_counts()

Right         711668
Russian       704917
local         459220
Left          427141
Hashtager     241786
news          139006
Commercial    121904
German         91511
Italian        15680
?              13539
Koch           10894
Arabic          6228
Spanish         1226
French          1117
ZAPOROSHIA       175
Portuguese       118
Ebola             71
Ukranian           4
Uzbek              2
Name: account_type, dtype: int64

Later, we will consider some summary statistics for pairs of `series`, such as computing correlations and covariance. 

## DataFrames

We already have our `DataFrame` loaded into memory (as `df`), but so far all we have used it for is pulling out individual `series`. This is easy to do in part because `DataFrames` are themselves just collections of `Series` that are aligned on the same `index` values. In other words, both `Series` we worked with previously -- `atype` and `num_followers` -- have their own `indexes` when we work with them as `Series`, but in a `DataFrame`, they share an index. `DataFrames` are organized the way we would expect: with variables in the columns and observations in the rows. We can use the `.head()` method to preview the top $n$ rows of the dataset. 

In [24]:
df.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,2535818742,HAPPKENDRAHAPPY,Bosh situation hasn't changed since Feb. Heat ...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2285,...,Right,1,RightTroll,0,2535818742,779366817752084481,http://twitter.com/happkendrahappy/statuses/77...,,,
1,2535818742,HAPPKENDRAHAPPY,Youre an IDIOT! Now you say 99% when before yo...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2284,...,Right,1,RightTroll,0,2535818742,779366807446642688,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/erecordscity/status/779354...,,
2,2535818742,HAPPKENDRAHAPPY,Charlotte-Mecklenburg Fraternal Order of Polic...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2286,...,Right,1,RightTroll,0,2535818742,779366828585910272,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
3,2535818742,HAPPKENDRAHAPPY,Theodore Roosevelt's son Quentin and his frien...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2290,...,Right,1,RightTroll,0,2535818742,779367073998835712,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/HistoryInPics/status/73882...,,
4,2535818742,HAPPKENDRAHAPPY,.@flashfire451: Are there more cures than term...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2289,...,Right,1,RightTroll,0,2535818742,779367063479521281,http://twitter.com/happkendrahappy/statuses/77...,,,
5,2535818742,HAPPKENDRAHAPPY,Suspected Illegal Alien Marijuana Farmers Held...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2288,...,Right,1,RightTroll,0,2535818742,779367053077712896,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
6,2535818742,HAPPKENDRAHAPPY,This picture is 100% BOGUS! Just watch MSNBC &...,United States,English,9/23/2016 17:10,9/23/2016 17:10,1311,1688,2291,...,Right,1,RightTroll,0,2535818742,779367286385831937,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/micheleredding2/status/779...,,
7,2535818742,HAPPKENDRAHAPPY,Proud to be part-Polish! Poland Initially Appr...,United States,English,9/23/2016 17:11,9/23/2016 17:11,1311,1688,2293,...,Right,1,RightTroll,0,2535818742,779367574454800384,http://twitter.com/happkendrahappy/statuses/77...,http://www.lifenews.com/2016/09/23/poland-pois...,,
8,2535818742,HAPPKENDRAHAPPY,in your case Hillary if you do when you're go...,United States,English,9/23/2016 17:11,9/23/2016 17:12,1311,1688,2292,...,Right,1,RightTroll,0,2535818742,779367562782117889,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/cboutet11/status/778718668...,,
9,2535818742,HAPPKENDRAHAPPY,#ThingsMoreTrustedThanHillary Any fairy tale book,United States,English,9/27/2016 1:35,9/27/2016 1:37,1311,1686,2293,...,Right,0,RightTroll,0,2535818742,780581631342080001,http://twitter.com/happkendrahappy/statuses/78...,,,


Alternatively, we could use the `.sample()` method to pull a random sample of $n$ observations, which can be helpful if we don't want the observations we preview to be from the top (`head`) or bottom (`tail`) of the dataset.

In [25]:
df.sample(5)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
233926,2882331822,JENN_ABRAMS,'@ladygaga be ourselves is all that we can do',United States,English,8/4/2015 11:41,8/4/2015 11:41,10374,29359,5543,...,Right,0,RightTroll,0,2882331822,628531106829086721,http://twitter.com/Jenn_Abrams/statuses/628531...,,,
85765,1645969141,FIGHTM_D_B,"""You can't separate peace from freedom because...",United States,English,5/20/2016 13:54,5/20/2016 13:55,1047,425,1451,...,Left,1,LeftTroll,0,1645969141,733657143153414144,http://twitter.com/FightM_D_B/statuses/7336571...,,,
30864,892000000000000000,CHAASNTR,https://t.co/mfBjIJYL05 Rewind 07/2017 \| All ...,Unknown,English,8/15/2017 17:11,8/15/2017 17:11,2909,1335,6794,...,Right,0,RightTroll,0,891902187130966017,897505985253781505,http://twitter.com/891902187130966017/statuses...,https://hedgeaccordingly.com/2017/07/all-too-c...,http://Covfefe.bz,
29915,2494112058,DAILYSANJOSE,#politics Spike Lee Endorses Bernie Sanders,United States,English,2/23/2016 15:26,2/23/2016 15:26,6387,11816,13865,...,local,0,NewsFeed,1,2494112058,702152636041027585,http://twitter.com/DailySanJose/statuses/70215...,,,
84770,2547141851,CHICAGODAILYNEW,Cubs working to arrange White House visit befo...,United States,English,12/5/2016 23:04,12/5/2016 23:04,6952,19642,41743,...,local,0,NewsFeed,0,2547141851,805910745795923969,http://twitter.com/2547141851/statuses/8059107...,https://twitter.com/ChicagoDailyNew/status/805...,http://www.chicagotribune.com/ct-cubs-white-ho...,


When working with a `dataframe`, we can select subsets of data by selecting columns or filtering rows. Let's look at selecting columns first. 

### Selecting Columns 

Earlier, we saw how we could select a single column using by specifying the name of the `dataframe` followed by the name of the `series` inside square brackets and straight quotes. 

In [26]:
followers = df['followers']
followers.sample(10)

135299      470
119646    18574
13572      1460
90734       790
8529        523
80977       590
119437    12402
140857      902
194257    14852
7092      14617
Name: followers, dtype: int64

We can select multiple columns by passing a list of column names. Whereas the result of the previous selection was a `Series` (because we only pulled one column), selecting multiple columns will return a `DataFrame` containing only the requested columns. 

In [27]:
ff = df[['followers', 'following']]
ff.sample(10)

Unnamed: 0,followers,following
48251,1801,4968
40762,953,853
210105,11230,5137
178212,26423,18522
129568,1935,1483
93530,253,869
143807,124,247
55191,2587,2511
190334,354,472
256457,1481,1268


This kind of subsetting can be very helpful when, for example, you are working with datasets that have a lot of columns, only some of which are required for your analysis. 

### Filtering Rows 

It is also sometimes necessary to filter rows. There are a variety of ways to do this, including slices (e.g. all observations between index $i_i$ and index $i_j$). In a data analysis context, most of the row filtering you will do is likely to be based on some sort of explicit condition, such as "give me all the observations with more than 1,000 followers." Most likely, you will only filter rows based on subsets if you are selecting the first $n$ rows of a `DataFrame` that has been sorted by the values of some `Series`. We will consider this case later. 

In [28]:
df[df['followers'] >= 1000].sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
187315,4437233895,CRYSTAL1JOHNSON,"Cops Body Slam, Beat & Make Racist Comments to...",United States,English,11/13/2016 20:50,11/13/2016 20:50,10690,21879,2767,...,Left,0,LeftTroll,0,4437233895,797904461800669184,http://twitter.com/4437233895/statuses/7979044...,https://twitter.com/Crystal1Johnson/status/797...,,
143021,4272870988,PAMELA_MOORE13,Judge Nap: Loretta Lynch may face felony charg...,United States,English,6/28/2017 1:00,6/28/2017 1:01,42039,63048,5208,...,Right,0,RightTroll,0,4272870988,879867111224602625,http://twitter.com/4272870988/statuses/8798671...,https://twitter.com/Pamela_Moore13/status/8798...,,
71615,2559217373,ALEXSVLADIMIROV,Мы не дадим противникам #Brexit связать нам ру...,Unknown,Bulgarian,11/6/2016 2:15,11/6/2016 2:15,71,2211,614,...,Russian,1,NonEnglish,0,2559217373,795087083287281664,http://twitter.com/2559217373/statuses/7950870...,https://twitter.com/zvezdanews/status/79508537...,http://tvzvezda.ru/news/vstrane_i_mire/content...,
158165,2951506251,ROOMOFRUMOR,Twitter users cheer Thursday Night Football li...,United States,English,9/16/2016 2:46,9/16/2016 2:46,8562,12270,29247,...,news,0,NewsFeed,1,2951506251,776613088783052800,http://twitter.com/RoomOfRumor/statuses/776613...,,,
194503,1670839033,JEANETTEDBOLDEN,#IDontNeedACostumeBecause The shotgun and deli...,United States,English,10/26/2016 15:18,10/26/2016 15:19,1506,1772,2043,...,Hashtager,1,HashtagGamer,1,1670839033,791298077214797824,http://twitter.com/JeanetteDBolden/statuses/79...,,,
207115,898000000000000000,CAARMBUTL,BREAKING: Senate Intel Committee BLASTS Media ...,Unknown,English,10/5/2017 23:56,10/5/2017 23:56,4822,2288,1960,...,Right,0,RightTroll,0,898418554311131136,916089788129824770,http://twitter.com/898418554311131136/statuses...,http://ift.tt/2xYkKPt,,
212410,2912754262,PIGEONTODAY,'@NBCNews https://t.co/0TnZNw0uk2',United States,English,12/7/2015 11:11,12/7/2015 11:12,13855,19991,11045,...,Right,0,RightTroll,0,2912754262,673822235463835648,http://twitter.com/PigeonToday/statuses/673822...,https://twitter.com/PigeonToday/status/6738222...,,
175737,2578422308,RUSSIANALLIES,Су-30СМ https://t.co/lzScIOt9nQ,United Arab Emirates,Russian,8/2/2017 18:30,8/2/2017 19:26,3888,25452,21471,...,Russian,0,NonEnglish,0,2578422308,892814762106429440,http://twitter.com/2578422308/statuses/8928147...,https://twitter.com/russianallies/status/89281...,,
191262,789000000000000000,WORLDNEWSPOLI,98 percent of Republican military veterans app...,United States,English,5/29/2017 17:12,5/29/2017 17:12,4319,3083,31548,...,Right,0,RightTroll,0,789266125485998080,869239951803637760,http://twitter.com/789266125485998080/statuses...,https://twitter.com/WorldnewsPoli/status/86923...,http://www.washingtontimes.com/news/2017/may/2...,
198260,2753211010,PHOENIXDAILYNEW,North Scottsdale condos move forward amid conc...,United States,English,2/4/2017 14:18,2/4/2017 14:18,5294,14827,21277,...,local,0,NewsFeed,0,2753211010,827883988182237190,http://twitter.com/2753211010/statuses/8278839...,https://twitter.com/PhoenixDailyNew/status/827...,http://www.azcentral.com/story/news/local/scot...,


Alternatively, we could filter based on membership in some category, such as being a `RightTroll` or `LeftTroll` account. `RightTroll` and `LeftTroll` are attributes of the `account_category` `series`. Let's get `RightTroll` accounts. 

In [29]:
df[df['account_category'] == 'RightTroll'].sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
137144,890000000000000000,LAWWAANCTR,#top RT TerreBehlog: 🇺🇸 Register To Vote Now 🇺...,Unknown,English,8/14/2017 10:54,8/14/2017 10:54,2965,948,6220,...,Right,0,RightTroll,0,890475284507623424,897048711884410880,http://twitter.com/890475284507623424/statuses...,https://twitter.com/TerreBehlog/status/8970475...,,
219396,1671936266,ARM_2_ALAN,A billionaire gives #Harvard a $400 million en...,United States,English,6/8/2015 2:33,6/8/2015 2:33,68,149,5306,...,Right,1,RightTroll,0,1671936266,607737191800840193,http://twitter.com/Arm_2_Alan/statuses/6077371...,https://twitter.com/Jenn_Abrams/status/6077249...,,
219271,898000000000000000,CAMELIISRT,"Crybaby Kaepernick Files Grievance, Claims NFL...",United States,English,10/16/2017 1:20,10/16/2017 1:20,4665,2678,2352,...,Right,0,RightTroll,0,898394737618501632,919734791985094656,http://twitter.com/898394737618501632/statuses...,http://ift.tt/2gHVEgk,,
251657,719000000000000000,JIHADIST2NDWIFE,.@kafirkaty Western beauty standards are outda...,United States,English,4/23/2016 22:06,4/23/2016 22:06,661,841,176,...,Right,0,RightTroll,0,719281218965999616,723996387633819648,http://twitter.com/Jihadist2ndWife/statuses/72...,https://twitter.com/Jihadist2ndWife/status/723...,,
142146,4272870988,PAMELA_MOORE13,Absolutely true... https://t.co/BHJopDNLX5,United States,English,5/29/2017 22:03,5/30/2017 1:00,42273,56987,4493,...,Right,0,RightTroll,0,4272870988,869313250655182848,http://twitter.com/4272870988/statuses/8693132...,https://twitter.com/Pamela_Moore13/status/8693...,,
76712,892000000000000000,DAPNESSTR,"RT mikandynothem: Last we forget, Democrats ar...",Unknown,English,8/18/2017 9:28,8/18/2017 9:28,1960,1090,2477,...,Right,0,RightTroll,0,891937707370307585,898476681018650624,http://twitter.com/891937707370307585/statuses...,https://twitter.com/i/web/status/8983640140229...,,
75629,893000000000000000,ALEXXDRTRR,RT carrerapulse: The Obama admin facilitated p...,Unknown,English,8/18/2017 4:20,8/18/2017 4:20,34,6,2981,...,Right,0,RightTroll,0,893397699579576321,898399166098845696,http://twitter.com/893397699579576321/statuses...,http://bit.ly/2vwcPGx,,
49481,892000000000000000,CHARRISSTR,Identity politics of the democrat party are de...,Unknown,English,8/15/2017 21:15,8/15/2017 21:16,2918,1476,943,...,Right,0,RightTroll,0,891907553356955648,897567566960279552,http://twitter.com/891907553356955648/statuses...,,,
201026,891000000000000000,ARABMTR,RT TenaciousTrumps: Let's build our Trumper ar...,Unknown,English,8/2/2017 18:19,8/2/2017 18:19,1994,211,250,...,Right,0,RightTroll,0,891230914629521408,892812144370167810,http://twitter.com/891230914629521408/statuses...,https://twitter.com/i/web/status/8924392628886...,,
52098,1671234620,HYDDROX,Pope Again Forgets About Vatican’s GIANT Wall ...,United States,English,2/9/2017 6:25,2/9/2017 6:25,2560,2253,15731,...,Right,1,RightTroll,0,1671234620,829576879740952576,http://twitter.com/1671234620/statuses/8295768...,https://twitter.com/DailyCaller/status/8295044...,http://trib.al/icRAQ8M,


We are left with a subset of 711,668 accounts (check yourself: `len(df[df['account_category'] == 'RightTroll'])`) that are classified as `RightTrolls`. 

### Removing Duplicates

**<font color='crimson'>!! TODO !!</font>**

### Adding New Columns 
Often, we need to add new columns to our `DataFrame` based on values in other columns.   

In [49]:
# To save our computers we will use a subset
small_df = df.sample(1000)

Sometimes, these new columns are transformations of a single column that already exists in the `DataFrame`.    

For example, we can create a new `empty_tweet` column. This column, will be `True` when the `content` column is empty and `False` otherwise. 

In [50]:
small_df['empty_tweet'] = small_df['content'].isna()

We can also implement more complex transformations, such as those defined in custom functions. 

For example, the code below uses a custom function to extract the number of hashtags used in a tweet. 

In [67]:
def num_hashtags(row):
    tweet = row['content']
    try:
        num = tweet.count('#')
        return num
    except AttributeError:
        return 0

small_df['num_hashtags'] = small_df.apply(num_hashtags, axis=1)

In other cases, we will want to use multiple columns to create a new column.    

For example, we may want to extract the  calculate the follower-to-following ratio for accounts on Twitter. 

In [None]:
small_df['followers_following_ratio'] = small_df['followers'] / small_df['following']
small_df.sample(10)

Once again, we can use a custom function to transform multiple columns to create one new column.   

In the cell below, use a custom function and the `apply()` method to create a new column called `more_followers` from the `followers` and `following` columns. This column should be `True` if an account has more followers than following, and False otherwise.

In [None]:
# Your Answer Here

Checkout the results of our transformations in the `DataFrame` below. 

In [70]:
small_df.sample(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1,empty_tweet,num_hashtags
236911,2589513234,ZUBOVNIK,"Правительство РФ предупредило Эрдогана, что Ро...",Unknown,Russian,2/20/2016 13:36,2/20/2016 13:42,3611,21394,17642,...,NonEnglish,0,2589513234,701037739291107328,http://twitter.com/zubovnik/statuses/701037739...,https://twitter.com/zubovnik/status/7010377392...,,,False,0
188937,1708354368,EINSOLL,«Спартак-2» разгромил «Енисей» со счётом 4:0 h...,Unknown,Russian,8/3/2015 18:53,8/3/2015 18:53,1475,279,2316,...,NonEnglish,1,1708354368,628277556777742336,http://twitter.com/EInsoll/statuses/6282775567...,https://twitter.com/podrobnosti_biz/status/628...,http://podrobnosti.biz/news/2597-spartak2-razg...,,False,0
196870,2753211010,PHOENIXDAILYNEW,Pearl Harbor Remembrance Day in Phoenix https:...,United States,English,12/8/2016 0:22,12/8/2016 0:22,5321,14864,20159,...,NewsFeed,0,2753211010,806655081290534913,http://twitter.com/2753211010/statuses/8066550...,https://twitter.com/PhoenixDailyNew/status/806...,http://www.azcentral.com/videos/news/local/pho...,,False,0
67599,2530603456,ALEKSEY_SOKOL_,First in line to the cake table at the #CFDA p...,Unknown,English,10/27/2016 4:09,10/27/2016 4:09,1702,338,2610,...,NonEnglish,0,2530603456,791491885680525312,http://twitter.com/Aleksey_Sokol_/statuses/791...,https://twitter.com/MariaSharapova/status/7914...,,,False,2
30330,508761973,NOVOSTISPB,"Церковь святого Александра Невского или, по-др...",United States,Russian,5/21/2016 20:42,5/21/2016 20:49,3312,105708,36932,...,NonEnglish,1,508761973,734122261586841600,http://twitter.com/NovostiSPb/statuses/7341222...,https://twitter.com/NovostiSPb/status/73412226...,,,False,0
69872,2532611755,KATHIEMRR,it's criminal #kind,Unknown,English,11/29/2014 8:13,11/29/2014 8:13,63,60,544,...,HashtagGamer,0,2532611755,538606630225465344,http://twitter.com/KathieMrr/statuses/53860663...,,,,False,1
87688,1715424829,KENZDONOVAN,�That's Dads for you! Right @RandPaul? https:/...,United States,English,8/1/2015 1:03,8/1/2015 1:03,452,283,1391,...,RightTroll,0,1715424829,627283577898532864,http://twitter.com/KenzDonovan/statuses/627283...,https://twitter.com/buzzfeedandrew/status/6272...,,,False,0
43099,2587843805,KANSASDAILYNEWS,Kansas man struck by lightning: ‘The worst pai...,United States,English,3/30/2017 4:31,3/30/2017 4:31,5614,25338,48370,...,NewsFeed,0,2587843805,847305187131654147,http://twitter.com/2587843805/statuses/8473051...,https://twitter.com/KansasDailyNews/status/847...,http://ksn.com/2017/03/29/kansas-man-struck-by...,,False,0
235001,2630842499,DAILYSANDIEGO,Owners Who Left Dog For Dead Plead Guilty to A...,United States,English,4/29/2017 7:26,4/29/2017 7:26,7460,16300,20531,...,NewsFeed,0,2630842499,858220858854211584,http://twitter.com/2630842499/statuses/8582208...,https://twitter.com/DailySanDiego/status/85822...,http://www.nbcsandiego.com/news/local/Owners-L...,,False,0
15891,1512371617,CATELINEWATKINS,"#ThereIsMoreThanOne valid opinion, not just yo...",United States,English,9/26/2016 18:47,9/26/2016 18:48,1570,1933,2273,...,HashtagGamer,0,1512371617,780478991681466368,http://twitter.com/CatelineWatkins/statuses/78...,,,,False,1


### Background Knowledge &mdash; Avoid Slow Pandas [<i class="fa fa-forward"></i>](#skip_slow)

Regardless of the type of transformation you are doing, there is one common pitfall you should avoid in pandas &mdash; looping over rows.

**<font color='crimson'>!! TO FINISH !!</font>**

<a id='skip_slow'></a>
# Aggregation and Grouped Operations

Some of the most common tasks in any given data analysis project involve some sort of aggregation or grouped operation. For example, we might want to compute and compare descriptive statistics for observations that take different values on a categorical variable. Let's see how to do that, and other grouped operations, with `pandas`. 

In brief, the `group_by()` method splits the `dataframe` into groups based on the values of a given variable. We can then perform operations on the resulting groups, such as computing descriptive statistics. 

In [28]:
grouped = df.groupby('account_category')
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

The code above returns a grouped object that we can work with. Let's say we want to pull out a specific group. We can use the `get_group()` method to pull a group from the grouped object. (Note that the `.get_group()` code below is equivalent to `df[df['account_type'] == 'RightTroll']`.) 

In [29]:
right_troll_group = grouped.get_group('RightTroll')
right_troll_group.head(10)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,...,account_type,retweet,account_category,new_june_2018,alt_external_id,tweet_id,article_url,tco1_step1,tco2_step1,tco3_step1
0,2535818742,HAPPKENDRAHAPPY,Bosh situation hasn't changed since Feb. Heat ...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2285,...,Right,1,RightTroll,0,2535818742,779366817752084481,http://twitter.com/happkendrahappy/statuses/77...,,,
1,2535818742,HAPPKENDRAHAPPY,Youre an IDIOT! Now you say 99% when before yo...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2284,...,Right,1,RightTroll,0,2535818742,779366807446642688,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/erecordscity/status/779354...,,
2,2535818742,HAPPKENDRAHAPPY,Charlotte-Mecklenburg Fraternal Order of Polic...,United States,English,9/23/2016 17:08,9/23/2016 17:08,1311,1688,2286,...,Right,1,RightTroll,0,2535818742,779366828585910272,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
3,2535818742,HAPPKENDRAHAPPY,Theodore Roosevelt's son Quentin and his frien...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2290,...,Right,1,RightTroll,0,2535818742,779367073998835712,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/HistoryInPics/status/73882...,,
4,2535818742,HAPPKENDRAHAPPY,.@flashfire451: Are there more cures than term...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2289,...,Right,1,RightTroll,0,2535818742,779367063479521281,http://twitter.com/happkendrahappy/statuses/77...,,,
5,2535818742,HAPPKENDRAHAPPY,Suspected Illegal Alien Marijuana Farmers Held...,United States,English,9/23/2016 17:09,9/23/2016 17:09,1311,1688,2288,...,Right,1,RightTroll,0,2535818742,779367053077712896,http://twitter.com/happkendrahappy/statuses/77...,http://www.breitbart.com,,
6,2535818742,HAPPKENDRAHAPPY,This picture is 100% BOGUS! Just watch MSNBC &...,United States,English,9/23/2016 17:10,9/23/2016 17:10,1311,1688,2291,...,Right,1,RightTroll,0,2535818742,779367286385831937,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/micheleredding2/status/779...,,
7,2535818742,HAPPKENDRAHAPPY,Proud to be part-Polish! Poland Initially Appr...,United States,English,9/23/2016 17:11,9/23/2016 17:11,1311,1688,2293,...,Right,1,RightTroll,0,2535818742,779367574454800384,http://twitter.com/happkendrahappy/statuses/77...,http://www.lifenews.com/2016/09/23/poland-pois...,,
8,2535818742,HAPPKENDRAHAPPY,in your case Hillary if you do when you're go...,United States,English,9/23/2016 17:11,9/23/2016 17:12,1311,1688,2292,...,Right,1,RightTroll,0,2535818742,779367562782117889,http://twitter.com/happkendrahappy/statuses/77...,https://twitter.com/cboutet11/status/778718668...,,
9,2535818742,HAPPKENDRAHAPPY,#ThingsMoreTrustedThanHillary Any fairy tale book,United States,English,9/27/2016 1:35,9/27/2016 1:37,1311,1686,2293,...,Right,0,RightTroll,0,2535818742,780581631342080001,http://twitter.com/happkendrahappy/statuses/78...,,,


As previously mentioned, sometimes we want to compute some value for a group within the dataset. We can do this by specifying the grouped object, the `Series` we want to perform an operation on, and finally the operation we want to perform. A full list of operations available when working with `Series` can be found in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html).

In [30]:
grouped['followers'].median()

account_category
Commercial        273
Fearmonger         48
HashtagGamer     2480
LeftTroll         836
NewsFeed        14722
NonEnglish        503
RightTroll       1437
Unknown           205
Name: followers, dtype: int64

In [31]:
grouped['following'].median()

account_category
Commercial         3
Fearmonger        65
HashtagGamer    2613
LeftTroll        796
NewsFeed        7089
NonEnglish       434
RightTroll      1864
Unknown          567
Name: following, dtype: int64

There are many things you can do here, such as comparing the ratio of followers to following. 

In [32]:
grouped['followers'].median() / grouped['following'].median()

account_category
Commercial      91.000000
Fearmonger       0.738462
HashtagGamer     0.949101
LeftTroll        1.050251
NewsFeed         2.076739
NonEnglish       1.158986
RightTroll       0.770923
Unknown          0.361552
dtype: float64

We can also perform some operations on the grouped object itself, such as computing the number of observations in each group, which in this case is equal to the number of tweets sent by accounts in each category. 

In [33]:
grouped.size().sort_values(ascending=False)

account_category
NonEnglish      820803
RightTroll      711668
NewsFeed        598226
LeftTroll       427141
HashtagGamer    241786
Commercial      121904
Unknown          13539
Fearmonger       11140
dtype: int64

It is also possible to group by multiple variables, such as `account_category` and `language`, and then perform an operation on the groups, such as compute the median number of followers. 

In [34]:
cat_lang = df.groupby(['account_category', 'language'], as_index=False)['followers'].median()
cat_lang.sample(30)

Unnamed: 0,account_category,language,followers
99,HashtagGamer,Hungarian,2745.0
237,NonEnglish,Indonesian,126.5
255,NonEnglish,Slovak,1220.0
187,NewsFeed,Farsi (Persian),13768.0
254,NonEnglish,Simplified Chinese,6865.0
226,NonEnglish,Estonian,160.0
246,NonEnglish,Malay,1523.5
172,LeftTroll,Turkish,769.5
245,NonEnglish,Macedonian,406.0
39,Commercial,Serbian,257.0


Depending on what you are doing, the result of a grouped analysis like this could be a `Series` or a `DataFrame`. 

Finally, we can perform *multiple* operations on a grouped object by using the `agg()` method ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html)). The `agg()` method will apply one or more aggregate functions to a grouped object, returning the results of each. 

To specify which operations `agg()` will apply, a list of functions or string function names is provided. Each function must accept a `Series` or `DataFrame` as input (depending on what type the grouped object is) or work when passed to the `apply()` method. 

We can re-implement the meridian calculation from above using the `agg()` function.

In [44]:
grouped['followers'].agg([np.median])

Unnamed: 0_level_0,median
account_category,Unnamed: 1_level_1
Commercial,273
Fearmonger,48
HashtagGamer,2480
LeftTroll,836
NewsFeed,14722
NonEnglish,503
RightTroll,1437
Unknown,205


We can specify additional functions by adding to the list provided to `agg()`. Notice that we can use a function or a string function name when specifying which operations to apply.

In [46]:
grouped['followers'].agg([min, np.median, 'max', 'count'])

Unnamed: 0_level_0,min,median,max,count
account_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Commercial,0,273,858,121904
Fearmonger,0,48,120,11140
HashtagGamer,0,2480,24663,241786
LeftTroll,0,836,56725,427141
NewsFeed,0,14722,62088,598226
NonEnglish,-1,503,251276,820803
RightTroll,0,1437,145244,711668
Unknown,0,205,6343,13539


We can even define our own function for `agg()` to use.

In [47]:
def count_greater_than_0(series):
    gt_0 = series[series > 0]
    return len(gt_0)

grouped['followers'].agg([min, np.median, 'max', 'count', count_greater_than_0])

Unnamed: 0_level_0,min,median,max,count,count_greater_than_0
account_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Commercial,0,273,858,121904,121704
Fearmonger,0,48,120,11140,10898
HashtagGamer,0,2480,24663,241786,240713
LeftTroll,0,836,56725,427141,426891
NewsFeed,0,14722,62088,598226,597888
NonEnglish,-1,503,251276,820803,817637
RightTroll,0,1437,145244,711668,703651
Unknown,0,205,6343,13539,12997


### Background Knowledge &mdash; Axes [<i class="fa fa-forward"></i>](#skip_axes)

Once again, feel free to skip this section. 

If you took a chance to delve into the documentation for any of the aggregation functions, you may have noticed an optional parameter called `axis`. The description of this parameter usually says something like: 

> **axis : {0 or ‘index’, 1 or ‘columns’}, default 0**   
If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

In other words, `axis=0` is going to operate over the columns of the dataframe and `axis=1` will operate over the rows in the dataframe. 

If we are doing tasks related to columns, such as calculating the median value of a column or sorting by a column, we will want to set `axis=0`. 

If we are doing tasks related to rows, such as dropping rows with missing values or using the `apply()` method to create a new column, we will want to set `axis=1`. 

As a bit of foreshadowing, operating over rows is generally very slow. We will touch on it later on during [a later]() Background Knowledge section.

<a id='skip_axes'></a>
## Sorting and Ranking

Sorting and ranking observations based on some criteria is a common data analysis task. For example, we might want to know which accounts in our dataset have the most followers. 

First, I will create a new `DataFrame` with some extra follower information for each account category & language group. 

In [70]:
new_cat_lang = df.groupby(['account_category', 'language'], as_index=False)['followers'].agg([min, np.median, max]).reset_index()
new_cat_lang.columns = ['account_category', 'language', 'followers_min', 'followers_median', 'followers_max']
new_cat_lang.sample(10)

Unnamed: 0,account_category,language,followers_min,followers_median,followers_max
16,Commercial,Gujarati,191,191.0,191
51,Commercial,Turkish,99,381.0,853
122,HashtagGamer,Tagalog (Filipino),9,2689.0,22638
191,NewsFeed,Hungarian,2,13156.0,16724
152,LeftTroll,Japanese,20,837.5,2427
324,Unknown,Farsi (Persian),0,61.0,92
259,NonEnglish,Swedish,0,697.5,6518
278,RightTroll,Finnish,2,119.0,32819
123,HashtagGamer,Thai,2147,2522.5,3751
17,Commercial,Hebrew,255,299.0,372


To start, we can sort `new_cat_lang` based on the median number of followers. 

In [71]:
new_cat_lang.sort_values('followers_median', ascending=False)[:10]

Unnamed: 0,account_category,language,followers_min,followers_median,followers_max
242,NonEnglish,LANGUAGE UNDEFINED,0,26395.0,251275
178,NewsFeed,Arabic,0,20700.0,33185
213,NewsFeed,Turkish,18603,19685.0,20994
209,NewsFeed,Somali,10,19622.0,31854
188,NewsFeed,Finnish,12288,19035.5,25733
292,RightTroll,LANGUAGE UNDEFINED,0,17448.0,32459
215,NewsFeed,Uzbek,6,17026.0,27637
216,NewsFeed,Vietnamese,1136,16555.0,61661
177,NewsFeed,Albanian,7,16110.0,31813
183,NewsFeed,Danish,12140,15565.0,18792


Using the same criteria, we can also rank the account groups based on which have the greatest number of median followers. The account with most followers will be given a rank of 1. 

In [74]:
new_cat_lang['followers_median'].rank(method='max')

0      130.0
1      140.0
2      123.0
3      155.0
4      151.0
       ...  
347     66.0
348     16.0
349     81.0
350     59.0
351     35.0
Name: followers_median, Length: 352, dtype: float64

If we were to save this value as a new column, we could use it later on for filtering, conducting more analyses, or highlighting important accounts in a visualization. More on this later.   

### Breaking Ties
Consider the sort below. 

In [75]:
new_cat_lang.sort_values('followers_median', ascending=True)[:10]

Unnamed: 0,account_category,language,followers_min,followers_median,followers_max
155,LeftTroll,LANGUAGE UNDEFINED,0,0.0,622
345,Unknown,Swedish,0,0.0,0
282,RightTroll,Gujarati,0,0.0,11
71,Fearmonger,LANGUAGE UNDEFINED,0,0.0,0
268,RightTroll,Bengali,1,1.0,1
269,RightTroll,Bulgarian,1,1.0,1
327,Unknown,German,0,7.0,220
214,NewsFeed,Urdu,10,10.0,10
340,Unknown,Russian,0,12.0,6343
249,NonEnglish,Portuguese,0,18.0,4348


As you can see, there are multiple account groups with 0 followers. In these cases it might be useful to break the tie using another column. 

To do this, we can specify multiple columns to be used during sorting. The importance of the columns in the sort is determined by the order in which they are provided. For example, in the cell below the `followers_median` column will be used to sort the data first, then the `followers_max` column will only be used to break ties in the original sort.  

In [76]:
new_cat_lang.sort_values(['followers_median', 'followers_max'], 
                         ascending=True)[:10]

Unnamed: 0,account_category,language,followers_min,followers_median,followers_max
71,Fearmonger,LANGUAGE UNDEFINED,0,0.0,0
345,Unknown,Swedish,0,0.0,0
282,RightTroll,Gujarati,0,0.0,11
155,LeftTroll,LANGUAGE UNDEFINED,0,0.0,622
268,RightTroll,Bengali,1,1.0,1
269,RightTroll,Bulgarian,1,1.0,1
327,Unknown,German,0,7.0,220
214,NewsFeed,Urdu,10,10.0,10
340,Unknown,Russian,0,12.0,6343
249,NonEnglish,Portuguese,0,18.0,4348


<font color='crimson'>I couldn't find a method for breaking ties in rank... Does one exist?</font>

## Correlation and Covariance 

* Correlation matrix 

In [77]:
df.corr()

Unnamed: 0,following,followers,updates,retweet,new_june_2018,tweet_id
following,1.0,0.580259,0.15195,-0.305094,-0.150726,0.110589
followers,0.580259,1.0,0.233705,-0.312036,-0.049159,0.086571
updates,0.15195,0.233705,1.0,-0.17192,0.119216,0.14943
retweet,-0.305094,-0.312036,-0.17192,1.0,0.116437,-0.027388
new_june_2018,-0.150726,-0.049159,0.119216,0.116437,1.0,-0.353891
tweet_id,0.110589,0.086571,0.14943,-0.027388,-0.353891,1.0


In [78]:
df['followers'].corr(df['following'])

0.5802587806116586

* covariance 

In [None]:
# Dates and Times


# Dates and Times

> **Jillian** -- would be great to introduce this here if you have the bandwidth. Or I can later. I have started some time series plats in the next notebook and will be adding more tomorrow. If not, no big deal. 

JA: I will do this. Likely just be super basic. The weird things people need to know for working with pandas datetime objects. Sorting, transforming, extracting parts, converting to strings, etc. Anything else? 

## Selecting, Aggregating, Summarizing, and Subsetting <a id='sass'></a>

In `Pandas`, each individual column / variable is called a `Series`. To select a `Series` from our `dataframe`, we simply have to type the `Series` name inside square brackets following the name of the `dataframe` object itself. For example, to get the `Series` containing country names, we would type:

In [None]:
df['account_type']

We can perform simple operations on these individual `Series`, such as counting the number of observations per account type in the dataset. 

In [None]:
obs_by_type = df['account_type'].value_counts()
obs_by_type.sort_values(ascending = False)

We can create smaller `dataframes` by passing in a list of the `Series` we want to include, or by subsetting the observations based on some criteria. In the first case, we produce a smaller `dataframe` by selecting specific variables. In the second case we produce a smaller `dataframe` by selecting specific observations. For example, if we wanted to select the variables for following, followers, and account type, we could do the following: 

In [None]:
small = df[['following', 'followers', 'account_type']]
small.sort_values(['following'], ascending = False).head(40)

The second way we might want to subset our data is by selecting some subset of observations. For example, if we wanted to pull out a subset of our data for right troll accounts, we could do the following:

In [None]:
right_trolls = df[df['account_type'] == 'Right']
right_trolls.sample(10)

`Pandas` makes it easy to perform fundamental data operations, such as grouping by one variable and computing a mean value for another. Let's say, for example, that we wanted to group our dataset by account type and then get the median number of followers. (Note that I am not suggesting you should do this in a real data analysis, I am simply demonstrating `Pandas` functionality.) 

In [None]:
median_followers = df.groupby('account_type')['followers'].median().round(0)
median_followers.sort_values(ascending = False)[:35]

# Best practices for Pandas <a id='pandasbp'></a>

From [pandas](https://pandas.pydata.org/):
>pandas is a **fast**, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 

This is all true, with a pretty large caveat. Pandas is fast (and generally efficient), if you avoid some of the common pitfalls. Unfortunately, these traps are easy to fall for and many pandas users (even senior data scientists) don't know they might be slowing their code down 10-1000x. These people will often be hesitant to use pandas on large datasets and may dissuade others from using the library. 

However, by understanding a little about what is going on in the backend, we can avoid the worst of the problems and write relatively fast pandas code. 

[How is data stored in pandas?](#storage)   
[Efficient Transformation](#transform)   
[Efficient Initialization](#init)   

## How is data stored in Pandas? <a id='storage'></a>

<h3><font color='tomato'>### Series</font></h3>

### DataFrames
DataFrames are really just a collection of Series, with each column corresponding to its own Series. As we mentioned above, each item in a Series (or column) is stored right after the one before it. This means that the entire column is stored within a single range of memory.

However, the multiple Series (columns) that make a DataFrame can be stored anywhere in memory and are often not stored side-by-side. 

We can think of this like a grocery list for sandwiches. Lets imagine that each kind of sandwich we make is composed of 1 type of bread, 1 type of  meat and 1 type of vegetable. We could arrange our grocery list into a table like this: 

| sandwich_id | bread_type | meat_type  | vegetable_type |
|-------------|------------|------------|----------------|
| 0           | sourdough  | ham        | lettuce        |
| 1           | baguette   | turkey     | tomato         |
| 2           | rye        | roast beef | onion          |

We buy all of our bread products from a bakery, meat from a deli, and vegetables from a grocer. The result is that to get everything in a column, you can go to one location (e.g. bakery for bread_type). But to get everything from a row you will have to visit all three locations. 

This means that is it really fast to access an entire column, but really slow to access a row. Lets check it out!

>**Aside**   
>In the code below, the `%timeit` line is called a [magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#). The `timeit` magic lets us time the execution of a Python statement. 

In [None]:
# Setup code -- ideally this is changed to a dataset we are using 
import numpy as np
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.sample(5)

In [None]:
print("Column\n------")
%timeit sl = iris['petal_length']

print("Row\n------")
%timeit example_1 = iris.iloc[12]

This difference in speed is more than a 50x difference! 

## Transformations <a id='transform'></a>
One instance where the underlying storage structure and its consquence on speed is when applying transformations (calculations or other functions) to a DataFrame. 

A common case of this is when we want to add a new column to our DataFrame based on values in other columns. For example, we may want to:  
* Extract month from a data column
* Calculate area from width & length columns
* Predict whether a flight will be late by applying a deep learning model to the values of 5 other columns. 

There is a long list of transformations we might be interested in, many of which operate on a single row, independent of other rows. 

There are many ways to implement transformations in pandas, some of which take advantage of how DataFrames are stored and others that do not. Below, we are going to look at 6 methods for implementing transformations:
1. [For Loops](#for)
2. [`iterrows`](#iter)
3. [Apply Method](#apply)
4. [Zip & Iterate](#zip)
5. [Vectorized Functions](#vec) 
6. [NumPy Vectorized Functions](#np)

We will use a common example across transformation methods allowing us to compare the speed of each one. For each method we will create a new column called `petal_area` by multiplying `petal_length` by `petal_width`.

### 1 For Loops <a id='for'></a>
One possible method we can use to create this new column is to go row-by-row through the dataframe using a [for loop](01_introduction.ipynb#conditional).

For each row in the data frame we will calculate the area for that example by multiplying `petal_length` by `petal_width` and placing the result in a Series that would eventually be added as a column to the DataFrame.

Perhaps one of the most obvious ways to approach a transformation is to go row-by-row through the dataframe, doing the necessary transformations one at a time. A simple way to do this is using a for loop.

In [None]:
%%timeit
# Looping over the rows
area_column = []
for i in range(0, len(iris)):
    row = iris.loc[i]
    row_area = row['petal_length'] * row['petal_width']
    area_column.append(row_area)

While we haven't checked any other methods yet, I'll let you know that this is _really_ slow. If we think about how DataFrames are stored it becomes clear why this is so slow. 

Before I get into this, lets look at the second method. 



### 2  `iterrows()` <a id='iter'></a>
A second method we can use to add our new column is using the `iterrows()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows)). This is a built-in method pandas has implemented to iterate over the rows in a  DataFrame. 

This method creates a [generator object](https://wiki.python.org/moin/Generators), a special Python object, which we can use a for loop to iterate over. 

In [None]:
%%timeit
area_column = []
for idx, row in iris.iterrows():
    row_area = row['petal_length'] * row['petal_width']
    area_column.append(row_area)

While this was definitely faster than the basic [for loop](#for) approach, its still really slow. In fact, the underlying reason why these two approaches are so slow is the same.

Both approaches use a `for` loops to go row-by-row through the DataFrame. Gathering the data for that row as its needed. 

In our sandwich example, this is the equivalent of buying ingredients for sandwich 1, then buying ingredients for sandwich 2, etc. This results in visiting each shop (bakery, deli, grocer) once for every sandwich recipe!

The same thing is happening in pandas. To iterate over the rows using a for loop we retreive all values for row 1, then all values for row 2, etc. 

This is incredibly inefficient (imagine the funny looks you'd get on your 3rd visit to the bakery)! In fact, I would venture to say that **you should never use for loops when working with pandas DataFrames**. There might be cases when I'm wrong, but there is almost always a better approach than `for` loops. 

### 3 Apply Method <a id=apply></a>
A third approach we can use is the `apply()` method ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply)). This built-in pandas method applies a specific function across some axis (rows or columns). In our case, we want to apply a function along the column axis, applying the function to each row. 

To use apply, you have to define the function you want to apply. This function needs to take in a row, apply the function, and return some value. For our case, we'll define an `area()` function. 

In [None]:
%%timeit
def area(row):
    return row['petal_length'] * row['petal_width']

area_column = iris.apply(area, axis=1)

Thankfully this is faster than the previous two approaches. But we are still in the realm of miliseconds. This method is still relatively slow because it continues to go row-by-row through the Dataframe. 

Since `apply()` is used for a specific purpose, pandas is able to make assumptions and include optmizations that the more general approaches don't have access to. For example, the `apply()` method implements 

Its faster because of internal optimizations pandas is able to do. For example, `apply()` checks to see if your function is compatible with its "fast" mode ([docs](https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/frame.py#L6737-L6928)). As well, it offloads some of the work to C (a low-level language known for speed), only performing the functions itself in Python. 

Typically, I almost always avoid using `apply()`. Although, it does make for readable code.

### 4 Zip & Iterate <a id=zip></a>
A fourth method we can use is to use the built-in `zip()` function available in Python ([docs](https://docs.python.org/3.3/library/functions.html#zip)). This function takes in a group of iterators (lists, dictionaries, tuples, etc) and creates a new iterator where the i-th element in the iterator will be a tuple containing the i-th elements from each of the original iterators. 

For example:   
```
>>> l1 = [1, 2, 3]
>>> l2 = ['a', 'b', 'c']
>>> z = zip(l1, l2)
>>> list(z)
    [(1, 'a'), (2, 'b'), (3, 'c')]
```

In general this is quite a useful function that can be used for lots of different purposes. In our case we will:
1. Determine which columns are needed for the transformation
2. Zip these columns together
3. Iterate over the zipped object to retrieve pairs one at a time, applying some function to the pairs and storing the result in a list which will later become our new column

In [None]:
%%timeit
area_column = []
for w, l in zip(iris['petal_length'], iris['petal_width']):
    area = w*l
    area_column.append(area)

As you can see, this method offers a great improvement over our last method (~120x faster). The reason for this large improvement is this is the first method that avoids going row-by-row through the DataFrame. 

Instead of performing many (#rows x #columns) costly read operations, this method reads each column only once. The resulting data is stored temporarily in fast memory, where it can be accessed at little cost when it is needed for calculations. 

This is the method I typically use for calculations. While it offers a good balance of efficiency, readability, and flexibility. 

### 5 Use Vectorized Functions <a id=vec></a>
Depending on the transformation we are undertaking, we might be able to use a vectorized function. These functions operate on entire Series, rather than on individual values (aka vector functions). 

Vectorized functions are those which take in and operate on pandas Series. There are many built-in vectorized functions, such as `*` (shown below), `add()`, `between()`, and `shift`. You can also build your own vectorizef function as a combination of these built-in methods.  

In [None]:
%%timeit
# Straight up vector calculations
iris['petal_area'] = iris['petal_length'] * iris['petal_width']

Similar to option 4, this method is significantly faster than the first three approaches. Once again, this is because we are avoiding accessing rows one-by-one. 

In addition, vectorized functions are able to further optimize by making use of pre-compiled code written in a lower-level (and faster) language like C. 

Honestly, I'm unsure why this method appears to be slower than the our 4th option, the zip method. I suspect this will not always be the case, especially when functions become more complex. 

### 6 Use NumPy Vectorized Functions<a id=np></a>
For an improvement over method 5, we take one extra step and convert our pandas Series into NumPy arrays and apply the same vectorized functions to obtain our transformation. 

In [None]:
%%timeit
areas = np.array(iris['petal_length']) * np.array(iris['petal_width'])

As with the last two options, this approach avoids costly row-by-row reads. Like with option 5, this method also uses pre-compiled code to achieve further optmization. 

In addition, by converting the pandas Series to NumPy arrays, this method removes the overhead incurred by Pandas additional functionality. 

### and Beyond <a id='beyond'></a>
For the cases when even these options aren't fast enough, you can implement more advanced techniques to enhance performance. The improvements these advanced techniques can offer differ based on the problem at hand. For example, some techniques use functions and methods that are optimized for boolean comparisons (e.g. great than) but offer little improvements when working with other functions like addition. 

Some other approaches to checkout include: 
* Using [NumExpr](https://pypi.org/project/numexpr/2.6.1/) for extra fast numerical expressions
* Rewriting functions in [Cython](https://cython.org/)
* Using [Numba](https://numba.pydata.org/) to convert Python code to fast machine code. 

## Takeaways

While this difference in speed is hard (if not impossible) to notice for small datasets, it can become hugely consequential when working with large datasets or performing complex calculations. 

We have to remember that optimizing code should not be placed at the expense of functionality. Often its best to get something that works before going back and finding the most optimal solution. However, I hope that by introducing a couple of "Do's & Don'ts" your first insticts can help you avoid some of the easiest traps.


1. Never directly iterate over the rows in a DataFrame. Avoid anything that goes row-by-row.  
2. Working with NumPy arrays will be faster than pandas Series
3. DataFrame data is stored based on columns, not rows. This means its much faster to access a column than a row. 


# Open Work Time <a id='open'></a>