# Pandas
Pandas is an essential Python package for storing and manipulating datasets. Pandas is fast and extremely powerful. There are too many pandas functions to describe them all in this course. So you will learn in general what are the capabilities of this package, in order to be able to find further information by yourself. The official tutorials and documentation are great sources for learning about pandas.
* http://pandas.pydata.org/pandas-docs/stable/tutorials.html
* http://pandas.pydata.org/pandas-docs/stable/index.html

In this notebook you will learn basics of pandas. More pandas capabilities will be shown in further notebooks on sample datasets. In this introductory lesson you will use simple datasets availables in statsmodels package (if the next cell does not execute properly, you should install the missing package: statsmodels).

In [None]:
import numpy as np
import pandas as pd


In [None]:
# http://statsmodels.sourceforge.net/devel/datasets/generated/fair.html
import statsmodels.api as sm
data = sm.datasets.fair.load_pandas()
marr = data.exog

In [None]:
marr = pd.read_pickle("marr.p")

## DataFrame and Series
The basic data type is DataFrame, which consists of Series. Using a statsmodel function we have loaded a set of explanatory variables named marr. You can see how the dataset looks like by using head function, which takes a number of rows to show as an argument.

In [None]:
print(marr.head())
marr.head(10)

As you can see, output of print function in notebook does not look good. If notebook uses its default function, the table is formatted in a vidually appealing way. Sometimes you may want to write a line displaying head/tail of a dataframe before the last line of a cell. You may use notebook's display instead of print for this purpose.

In [None]:
from IPython.display import display
display(marr.tail(4))
display(marr.head(4))

Every DataFrame (df) has column names and indices (row names). If an index is generated automatically, it takes consecutive integer values, from. Index can take any form including strings. In most cases it is not useful and you should keep integers as indices.
* http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

By using column names you can easily create new views of existing DataFrame or create copies of its part. Unfortunately it is not always clear, if a new variable will be a copy or a reference. Usually though it will be a copy.

In [None]:
# Create a reference
c1marr = marr[:]
# Create a copy
c2marr = marr.copy()
print(c1marr._is_view, c2marr._is_view)

# Both lines create a copy, even though it is not explicit in the first case.
c4marr = marr[['age', 'children', 'rate_marriage']]
c5marr = marr[['age', 'children', 'educ']].copy()
print(c4marr._is_view, c5marr._is_view)
display(c4marr.head(3))


You can easily show and change column names.

In [None]:
print("Print an object containing columns: \n", marr.columns)

marr.columns = ['rate', 'age', 'yrs_married', 'children', 'religious', 'educ',
       'occupation', 'occupation_husb']
print("\Print values of a columns object after the change: \n", marr.columns.values)

marr.columns.values[2] = "years"
print("\nPrint values of a columns object after changing one of them: \n", marr.columns.values)

.values method returns contents of a given index or series. It is convenient if you need data in a simple format (usually for numpy). You can access a series using dot operator or column name in square brackets.

In [None]:
print(marr["age"].head())
print(marr.age.head())
print(marr.age.values[0:5])
print(type(marr.age.values))

Usually exploratory data analysis is the first step in data analysis. Obviously you may want to draw a histogram (charts will be shown later in the course), but you can also print numerical descriptions of data. All functions which are implemented in numpy are also available in pandas.
* List of descriptive functions: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats

In [None]:
print("Basic descriptions: \n", marr.age.describe())
print("\nNumber of levels: \n", marr.age.nunique())
print("\nCounts of levels: \n", marr.age.value_counts())
print("\nSome other descriptive measurement (mode): \n", marr.age.mode())



### Modifying contents
The contents of our series or df can be modified in multiple ways. Let's begin by creating a new column: age squared. The final result of all ways presented below is the same. Note that you have to use column name operator when assigning value to a variable (you must not use dot and column name).

In [1]:
marr["age2"] = marr["age"]*marr["age"]
marr["age2"] = marr.age*marr.age
marr["age2"] = marr["age"]**2
marr["age2"] = marr["age"].apply(lambda x: x**2)
marr["age2"] = np.power(marr["age"].values, 2)
marr["age2"] = [x**2 for x in marr["age"].values]
marr.head(3)

NameError: name 'marr' is not defined

You also may use list comprehension, even if you are not using the passed argument for computing the returned value.

In [None]:
import random
rainbow = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
marr["favColor"] = "col"
# List comprehension is used only to create a list of length equal to the number of dataframe's rows.
marr["favColor"] = [random.choice(rainbow) for x in marr.index.values]
marr.head(5)

### Indexing
There are two basic ways of indexing and selecting data in pandas: integer-position-based (.iloc) and label-based (.loc). The first is analogous to any two-dimensional matrix in numpy. The second refers to the label (index) of a dataframe, which may have any form - it does not have to be sorted, monotonic, numerical etc.

In [None]:
marr.set_index(np.random.permutation(np.arange(marr.shape[0])), inplace=True)
display(marr.head(5))
print("Chosen part of df: \n", marr.iloc[3:5, 2:4])
print("\nChosen part of series: \n", marr.children.iloc[3:5])

In [None]:
print(marr.shape)
print("Save indices of the red rows.")
redRows = marr.favColor=="red"
print(type(redRows), redRows.shape)
print(redRows.head(10))
print("Choose the red rows")
display(marr.loc[redRows].head(5))
display(marr.loc[marr.favColor=="red"].head(5))

print("Choose red or orange rows")
display(marr.loc[marr.favColor.isin(['red','orange'])].head(5))

print("Choose young red rows")
# You cannot use "and" instead of "&" in this case
display(marr.loc[(marr.favColor=="red") & (marr.age<=25)].head(5))
display(marr[(marr.favColor=="red") & (marr.age<=25)].head(5))
%timeit -n 10 marr.loc[(marr.favColor=="red") & (marr.age<=25)]
%timeit -n 10 marr[(marr.favColor=="red") & (marr.age<=25)]

### Indexing to modify
As you can see in the last example, .loc is not necessary, if you choose rows to display. Hovewer it is required when you modify rows.

In [None]:
marr.loc[marr.favColor=="red", "favColor"]="reddish"
marr.head(10)
# This code does not work:
# marr[marr.favColor=="red", "favColor"]="reddish"

In [None]:
# the code below is correct and the result is as expected, but pandas issues a warning
marr.favColor.loc[marr.favColor=="reddish"]="red"
marr.head(10)

### Queries
In practice repeating dataframe's name may be inconvenient if you want to select a part of a dataframe. This is why "query" interface has been created. Query is an method which passes a result to .loc, but has a clear and more readable syntax.

In [None]:
display(marr.query('favColor == "violet"').head(5))
display(marr.query('favColor == "violet" & age > 30').head(5))
# in this case "and" may be used instead of "&"
display(marr.query('favColor == "violet" and age > 30').head(5))

In [None]:
# You may combine various criteria, including comparison of columns.
display(marr.query('favColor == "blue" and years < educ').head(5))

Additionally, query allows inserting dynamic values to our queries. Operator @ refers to variables in Python (in the environment), not dataframe columns.

In [None]:
ageLimit = 30
display(marr.query('age <= @ageLimit').head(5))

colors = ["violet", "blue"]
display(marr.query('favColor in @colors').head(5))

for color in colors:
    display(marr.query('favColor in @color').head(5))

### Evaluation
Pandas allows evaluating variables in a way similar to "query". In some cases (but not always) using eval is faster than saving directly.

In [None]:
%timeit -n 5 marr["age2"] = marr["age"]*marr["educ"]
%timeit -n 5 marr.eval('age2 = age*educ', inplace=True)


## Creating dataframes and series
Often you will need to create a new dataframe or series from other sets or lists. There are many ways to do it, some basic ones are shown below.

In [None]:
# Let's create numpy vectors with random content at the beginning.
noUsers = 1000
ids = np.arange(1, noUsers+1)
sex = np.random.randint(0,2,(noUsers))
age = np.floor(np.maximum(np.minimum(np.random.gamma(5, scale=1.0, size=(noUsers)), 13),1)*6+5)
# Prepare a dictionary combining contents with column names
data = {'idUser': ids, 'sex': sex, 'age': age}
# Create a dataframe
users = pd.DataFrame(data)
display(users.head())

In [None]:
# You can skip creating a dictionary
users = pd.DataFrame(np.vstack([ids, sex, age]).transpose(), columns=['idUser', 'sex', 'age'])
display(users.head())

Often you will need to create variables dynamically (API, webscraping etc.). It is convenient to create a list of lists (as consecutive rows) and then create a dataframe.

In [None]:
rows = []
for k in range(10):
    row = [
        np.random.randint(0,k+1),
        np.random.randint(k,2*k+1),
        np.random.randint(2*k,3*k+1)
    ]
    rows.append(row)
display(pd.DataFrame(rows, columns=["var1", "var2", "var3"]))

There are even more functions to create dataframes in pandas. Apart from reading standard datasets (e.g. csv), it can read HTML and load frames from records or dictionaries.

In [None]:
# pd.DataFrame.from_

## apply and map
When modifying contents of a df you may sometimes want to use own, more complicated functions. One of possible solutions is writing a function in such a way that it takes a numpy vector as an argument and passing values of a series (e.g. marr["age2"] = np.power(marr["age"].values, 2). Sometimes you may want to perform operations on rows/columns/dataframes, and not only on a series. You can use apply, map and applymap methods for this purpose. They are very similar to each other. Simplified description below:
* apply - works on vectors, on a series, or on dataframe rows/columns.
* map - applies a function (including a dictionary) on each element of a series
* applymap - as above, but on each element of a dataframe

At this point you may not understand why would you need these methods, because you do not have experience with them. But look at apply and map now, so that you will have already seen them when you will need them.

In [None]:
# Perform operation on each element of a series
marr["binRel"] = marr.religious.map(lambda x: 0 if x<3 else 1)
marr.head()

In [None]:
# Perform operation on each element of a series, but in a slightly different way. 
marr["binRel"] = marr.religious.apply(lambda x: 0 if x<3 else 1)
marr.head()

When using series (series.apply as opposed to df.apply) the difference between apply and map is very subtle. Here you can see an example of differences: (https://stackoverflow.com/a/27368948)

In [None]:
display(marr.religious.iloc[0:5].apply(lambda x: pd.Series([x, x])))
display(marr.religious.iloc[0:5].map(lambda x: pd.Series([x, x])))

Applymap allows you to perform any function for each element. In practice, because arrays or dataframes usually have columns of different types, applymap is not often used.

In [None]:
marr.iloc[0:3].applymap(lambda x: print(type(x), x))

You will probably use apply on a whole dataframe quite often, e.g. for checking the maximum value in each column.

In [None]:
marr.apply(np.max)

When you use apply, every column/row is regarded as a series. This is why iterating over rows and addressing by columns may be convenient.

In [None]:
marr.iloc[0:20].apply(lambda x: "long happy marriage" if (x['age'] > 35 and x['years']>20) else "no", axis=1)

You must take into account, that creating a new series for every row creates large overhead.

In [None]:
%timeit -n 10 marr.apply(lambda x: "long happy marriage" if (x['age'] > 35 and x['years']>20) else "no", axis=1)
%timeit -n 10 np.apply_along_axis(lambda x: "long happy marriage" if (x[1] > 35 and x[2]>20) else "no", 1, marr.values)
%timeit -n 10 ["long happy marriage" if (x.age > 35 and x.years>20) else "no" for x in marr.itertuples()]

### Grouping
Operations on a grouped dataset are very often used. Popularity of pivot tables in excel is a proof of that. This operation is immensely useful for statistical description of a dataset. Look at the following examples.

In pandas, groupby method is used for this purpose. It creates groups of row indices by a given way. It allows you to avoid creating unnecessary copies of a whole dataframe. It is a particularly huge memory-saver when you already have a large datasets with a lot of columns. 

In [None]:
# You can save grouped rows as a separate variable,
colorGroups = marr.groupby(['favColor'])
# you can display or use one of the groups...
display(colorGroups.get_group("blue").head(5))
display(colorGroups.get_group("blue")['educ'].head(5))
# ...or perform a function on grouped values
display(colorGroups.count())
display(colorGroups.mean())

#### Aggregating
When you have groups, you want to use them for some purpose, like descriptive statistics for every group. When using agg() function you have much greater control over tables than when performing a function directly on grouped elements. You may freely choose which columns and functions should be used.

In [None]:
print("Basic aggregating")
display(colorGroups.agg({'educ':'sum', 'years': 'mean'}))

print("Aggregating using numpy/lambda functions")
display(colorGroups.agg({'educ':np.mean, 'years': lambda x: np.sqrt(x).sum()}))

print("Aggregating with many statistical functions for a single column")
marr.groupby(['favColor']).agg({'educ':[np.mean, 'sum', np.std], 'years': 'mean'})

The last example shows MultiIndex in pandas. In practice you may have several columns where a single column is also a whole dataframe (as shown below).

In [None]:
temp = marr.groupby(['favColor']).agg({'educ':[np.mean, 'sum', np.std], 'years': 'mean'})
display(temp)
type(temp["educ"])
display(temp["educ"])

You may group by more than one variable. Let's define a binary variable which groups people by age: 1 if they are older than 35, 0 otherwise

In [None]:
marr["older"] = (marr.age > 35)
# Of course order of arguments makes a difference
display(marr.groupby(['favColor', 'older']).agg({'educ':[np.mean, 'sum', np.std], 'years': 'mean'}))
display(marr.groupby(['older', 'favColor']).agg({'educ':[np.mean, 'sum', np.std], 'years': 'mean'}))
# Save the last result
aggs = marr.groupby(['older', 'favColor']).agg({'educ':[np.mean, 'sum', np.std], 'years': 'mean'})

MultiIndex may be useful, but definitely not in every case. Fortunately we may easily drop one unnecessary level.

In [None]:
aggs.columns =  [x+y.capitalize() for x,y in aggs.columns.values]
display(aggs)

MultiIndex on rows also may or may not be useful. Often it is better to have values as columns and not as index.

In [None]:
display(aggs.reset_index())
# We may drop only one index level
# Note that row index is not unique in this case.
display(aggs.reset_index(level=0))

#### Transforming and apply
Grouping may be useful not only for aggregating, but also for performing operations on columns inside groups. You may need to merge the results with our initial dataframe. .transform() function is used for this purpose, as it allows you to operate on a particular column. The cells below compute mean age of people in a group and broadcasts the values into original df shape.

In [None]:
marr["meanAgePerColor"] = marr.groupby(['favColor'])["age"].transform(np.mean)
display(marr.head(10))

You may want to perform operations on rows, but taking aggregates into account.

In [None]:
marr["ageDeMeaned"] = marr.groupby(['favColor'])["age"].transform(lambda x: x - np.mean(x))
display(marr.head(10))

Apply function gives us even more possibilities, because it can perform operations on a whole dataframe inside of a group. It makes operations on multiple columns easy.

In [None]:
print(marr.groupby(['favColor']).apply(lambda x: x["age"]-x["educ"]).shape)
display(marr.groupby(['favColor']).apply(lambda x: x["age"]-x["educ"]))
marr["nonEducYears2"]=marr.groupby(['favColor']).apply(lambda x: x["age"]-x["educ"])

In [None]:
temp = marr.groupby(['favColor']).apply(lambda x: x["age"]-x["educ"])
temp.index = [y for x,y in temp.index.values]
marr["nonEducYears"] = temp
marr["nonEducYears2"]=marr.groupby(['favColor']).apply(lambda x: x["age"]-x["educ"]).reset_index(level=0, drop=True)
marr.head(10)

Apply is very flexible and can run almost any function, also those which return objects in other dimensions. Below describe is used to summarize group properties.

In [None]:
print(marr.groupby(['favColor']).apply(lambda x: x.describe()).shape)
marr.groupby(['favColor']).apply(lambda x: x.describe())

#### Group filtering
Filter function returns these rows which meet some criteria inside a group.
For example, if you choose groups with mean age over 29, an incomplete set is returned,

In [None]:
print(marr.groupby(['favColor']).filter(lambda x: x["age"].mean() >29).shape)
temp = marr.groupby(['favColor']).filter(lambda x: x["age"].mean() >29)
print(temp.favColor.unique())
temp.head(10)