## Exploring data using Pandas

![Pandas](pandas.jpg)

So far we explored Python and a few native libraries. Now we will play a little to simplify our life with tools to conduct some **data analysis**.

**Pandas** is the most popular library (so far) to import and handle data in Python.

### Let's import some data from a CSV file

**When downloading my ipynb, remember to also get the `commits_pr.csv` file**

In [None]:
import pandas
cpr = pandas.read_csv("commits_pr.csv")

It became this easy to read a CSV file!!!
And more... Look at what my `cpr` is:

In [None]:
type(cpr)

Yes! A DataFrame. And it reads really nice, look:

In [None]:
cpr.tail()
### We can use head() and tail() functions to see a bit less

Before moving forward... Explaining a little about this dataset.

This dataset represents a series of Pull Requests made to a subset of projects hosted by GitHub. We worked on this data to capture a specific type of contributor, which we called *casual contributor*. These contributors are known by having a single pull request accepted in a project and not coming back (i.e., they have no long-term commitment to the project).

In this specific dataset, you will find the following columns:

* `user`: represent a user in GitHub (anonymized here)
* `project_name`: the name of GitHub project in which the pull request was accepted
* `prog_lang`: programming language of the project
* `pull_req_num`: unique identifier of the pull request
* `num_commits`: number of commits sent within that specific pull request



### Some information about the dataframe

Dimensions/shape of the dataset (lines vs. columns)

In [None]:
cpr.shape

What about the column names?

In [None]:
cpr.columns

And the datatype per column?

In [None]:
cpr.dtypes

Some more information: `info()` method prints information including the index dtype and column dtypes, non-null values and memory usage.

In [None]:
cpr.info()

What is the type of a specific column???

In [None]:
type(cpr["num_commits"])

A *serie* is a list, with one dimension, indexed. Each column of a dataframe is a series

Before moving ahead, we can use the types to filter some columns. 

Let's say we want only the columns that store `int`:

In [None]:
int_columns = cpr.dtypes[cpr.dtypes == "int64"].index
int_columns

Now... I just want to see these columns... **BOOM**

In [None]:
cpr[int_columns].head()

### What about statistical information about my DataFrame?

`describe()` method provides a summary of numeric values in your dataset: mean, standard deviation, minimum, maximum, 1st quartile, 2nd quartile (median), 3rd quartile of the columns with numeric values. It also counts the number of variables in the dataset (are there missing variables?)

In [None]:
cpr.describe()

We can do it for a Series...

In [None]:
#cpr["num_commits"].describe()
cpr.num_commits.describe()

In [None]:
#LOOK at this with a non-numeric column
cpr.prog_lang.describe() #either way work.

And we can get specific information per column

In [None]:
cpr.num_commits.median()

In [None]:
cpr.num_commits.mean()

In [None]:
cpr.num_commits.std()

### --------------####
### Playing with the data: sorting

We can sort our data easily using pandas.

In this example, sorting by Programming Language

In [None]:
cpr.sort_values("num_commits", ascending=False).head(10)

We can sort using *many columns*, by using a list (sort will happen from the first item to the last)

In [None]:
cpr.sort_values(["prog_lang", "project_name", "num_commits"], ascending=False).head(10)

In [None]:
cpr.head(10)

If you want to keep the sorted version, you can use the parameter `inplace`:

In [None]:
cpr.sort_values(["prog_lang", "project_name", "num_commits"], ascending=False, inplace=True)

In [None]:
cpr.head(10)
#cpr = pandas.read_csv("commits_pr.csv") #--> to return to the original order

### Counting the occurences of variables

So, to count the occurrences in a column we have to select the column first, and use the method `value_counts()`

In [None]:
cpr.prog_lang.value_counts()

But... I just want to know what are the languages out there. Is there a way?

*Always*

In [None]:
cpr["prog_lang"].unique()

## OK! Let's do something else... Like, selecting columns and filtering data

Let's say that I just want to look at the columns programming language, project name and number of commits. 

I can select them and create a new DF

In [None]:
selected_columns = ["prog_lang", "project_name", "num_commits"]
my_subset = cpr[selected_columns]
my_subset.head()

What if now I want to filter those projects written in `C` language?

In [None]:
only_C = cpr[(cpr["prog_lang"]=='C') & (cpr["num_commits"]==2)]
only_C.describe()

We can filter whatever we want:

In [None]:
single_commit = cpr[cpr["num_commits"] == 1]

We can create filters in variables, and use whenever we want, as well

In [None]:
one_commit = cpr["num_commits"]==1
language_C = cpr["prog_lang"]=="C"
multi_commit = cpr["num_commits"]>1

In [None]:
cpr[one_commit & language_C].head(10)

And... we can use OR (|) and AND(&) to play!

In [None]:
cpr[one_commit & language_C].head(10)

#### What if we want the pull requests with more than one commit for the projects written in "C" and those with 2 commits for the projects written in "typescript"???

Let's do it!


In [None]:
#####
two_commits = cpr["num_commits"]==2
language_typescript = cpr["prog_lang"]=="typescript"

cpr[(one_commit & language_C) | (two_commits & language_typescript)]


What if I wanted to convert number of commits into a feature by creating bands of values that we define:
* 1 commit = group 1
* 2 - 5 commits = group 2
* 6 - 20 commits = group 3
*  more than 20 = group 4

In [None]:
cpr.loc[cpr["num_commits"]==1, "group_commit"]=1
cpr.loc[(cpr["num_commits"]>1) & (cpr["num_commits"]<=5), "group_commit"]=2
cpr.loc[(cpr["num_commits"]>5) & (cpr["num_commits"]<=20), "group_commit"]=3
cpr.loc[cpr["num_commits"]>20, "group_commit"]=4
cpr.group_commit = cpr.group_commit.astype('int32')

In [None]:
cpr.head()

### I challenge you:

What if: I wanted to know how the average of num_commits for those pull requests in group_commit 4???

### I challenge you (2):

Can you do that average per language?


In [None]:
cpr[cpr["prog_lang"] == "typescript"].quantile(0.75)














### Some more... 

Let's work with a new dataset... 

This is not only related to casual contributors, but all contributors

In [None]:
commits_complete = pandas.read_csv('commit_complete.csv')
commits_complete.sort_values('num_commits', ascending=False).head(10)

In [None]:
commits_complete['num_commits'].corr(commits_complete['additions'])

In [None]:
commits_complete.corr()

In [None]:
commits_complete.corr(method='pearson').style.background_gradient(cmap='coolwarm')

### Can we play with graphics?

**Plot types:**
- 'line' : line plot (default)
- 'bar' : vertical bar plot
- 'barh' : horizontal bar plot
- 'hist' : histogram
- 'box' : boxplot
- 'kde' : Kernel Density Estimation plot
- 'density' : same as 'kde'
- 'area' : area plot
- 'pie' : pie plot
- 'scatter' : scatter plot
- 'hexbin' : hexbin plot

**Histogram**

In [None]:
cpr.num_commits.plot.hist(bins=200)

In [None]:
cpr[cpr["prog_lang"]=="C"].num_commits.plot.hist(bins=20, color="red", alpha=0.5)
cpr[cpr["prog_lang"]=="java"].num_commits.plot.hist(bins=20, alpha=0.5).legend(["C", "Java"])

In [None]:
cpr['prog_lang'].value_counts().plot.bar()

In [None]:
cpr[cpr["prog_lang"]== "C"].project_name.value_counts().plot.bar()

In [None]:
commits_complete.plot.scatter(x = "files_changed", y = "num_commits")

In [None]:
lang_c = cpr.prog_lang=="C"
lang_java = cpr.prog_lang=="java"
lang_php = cpr.prog_lang=="php"


cpr[(lang_c) | (lang_java) | (lang_php)].boxplot(by='prog_lang', column=['num_commits'])


In [None]:

plot = cpr[(lang_c) | (lang_java) | (lang_php)].boxplot(by='prog_lang', column=['num_commits'], showfliers=False, grid=False)

plot.set_xlabel("Language")
plot.set_ylabel("# of commits")
plot.set_title("")

**Just to show...**

that it is possible to do statistical analysis

In [None]:
from scipy import stats

stats.mannwhitneyu(cpr[(lang_c)].num_commits, cpr[(lang_java)].num_commits)


### Exporting

In [None]:
my_subset.to_dict()

In [None]:
cpr.to_csv('test.csv', sep=',')

## Go for the HW