<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
## Open Machine Learning Course
<center>Authors: [Yury Kashnitsky](https://www.linkedin.com/in/festline/), Data Scientist @ Mail.Ru Group, Ekaterina Demidova, Data Scientist @ Segmento <br>
Translated and edited by [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/), [Christina Butsko](https://www.linkedin.com/in/christinabutsko/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), [Sergey Isaev](https://www.linkedin.com/in/isvforall/) and [Artem Trunov](https://www.linkedin.com/in/datamove/) <br>All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.


# <center> Topic 1. Exploratory Data Analysis with Pandas

<img align="center" src="https://habrastorage.org/files/10c/15f/f3d/10c15ff3dcb14abdbabdac53fed6d825.jpg"/>
<br>


### Outline
 1. About the course
 2. Assignments
 3. Demonstration of main Pandas methods
 4. First attempt on predicting telecom churn
 5. Assignment #1
 6. Useful resources


## 1. About the course

With this article, we, [OpenDataScience](https://www.linkedin.com/company/11241268/), launch an open Machine Learning course. This is not aimed at developing another *comprehensive* introductory course on machine learning or data analysis (so this is not a substitute for fundamental education or online/offline courses/specializations and books). The purpose of this series of articles is to quickly refresh your knowledge and help you find topics for further advancement. Our approach is similar to that of the authors of [Deep Learning book](http://www.deeplearningbook.org/), which starts off with a review of mathematics and basics of machine learning – short, concise, and with many references to other resources. 

The course is designed to perfectly balance theory and practice; therefore, each topic is followed by an **assignment** with a deadline in a week. You can also take part in several Kaggle Inclass **competitions** held during the course.

### Syllabus
1. Exploratory data analysis with Pandas
1. Visual data analysis with Python
1. Classification, Decision Trees, and k Nearest Neighbors
1. Linear Classification and Regression
1. Bagging and Random Forest
1. Feature engineering and feature selection
1. Unsupervised Learning: Principal Component Anslysis and Clustering
1. Vowpal Wabbit: Learning with gigabytes of data
1. Time series analysis with Python
1. Gradient Boosting

### Community

One of the most vivid advantages of our course is active community. If you join the OpenDataScience Slack team, you’ll find the authors of articles and assignments right there in the same channel (#eng_mlcourse_open) eager to help you. This can help very much when you make your first steps in any discipline. Fill in [this form](https://docs.google.com/forms/d/1_pDNuVHwBxV5wuOcdaXoxBZneyAQcqfOl4V2qkqKbNQ/edit?usp=drive_web) to be invited. The form will ask you several questions about your background and skills, including a few easy math questions.

We chat informally, like humor and emoji. Not every MOOC can boast to have such an alive community. There is also a [subreddit](https://www.reddit.com/r/ods_ai/) designed for students participating in the course.

### Prereqiusites
The prerequisites are the following: basic concepts from calculus, linear algebra, probability theory and statistics, and Python programming skills. If you need to catch up, a good resource will be [Part I](http://www.deeplearningbook.org/contents/part_basics.html) from the "Deep Learning" book and various math and Python online courses (for Python, CodeAcademy will do). More info is available on the corresponding [Wiki page](https://github.com/Yorko/mlcourse_open/wiki/Prerequisites:-Python,-math-and-DevOps).

### What software you’ll need
As for now, you’ll only need [Anaconda](https://www.continuum.io/downloads) (built with Python 3.6) to reproduce the code in the course. Later in the course, you’ll have to install other libraries like Xgboost and Vowpal Wabbit.

You can also resort to the [Docker container](https://hub.docker.com/r/festline/mlcourse_open/) with all necessary software already installed. More info is available on the corresponding [Wiki page](https://github.com/Yorko/mlcourse_open/wiki/Software-requirements-and-Docker-container).


## 2. Assignments

- Each article comes with an assignment in the form of a [Jupyter](http://jupyter.org) notebook. The task will be to fill in the missing code snippets and to answer questions in a Google Quiz form;
- Each assignment is due in a week with a hard deadline;
- Please discuss the course content (articles and assignments) in the #eng_mlcourse_open cahnnel of the OpenDataScience Slack team or here in the comments to articles on Medium;
- The solutions to assignments will be sent to those who have submitted the corresponding Google form.

## 3. Demonstration of main Pandas methods 

Well.. There are dozens of cool tutorials on Pandas and visual data analysis. If you are familiar with these topics, just wait for the 3rd article in the series, where we get into machine learning.  

**[Pandas](http://pandas.pydata.org)** is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like `.csv`, `.tsv`, or `.xlsx`. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with `Matplotlib` and `Seaborn`, `Pandas` provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in `Pandas` are implemented with **Series** and **DataFrame** classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of `Series` instances. `DataFrames` are great for representing real data: rows correspond to instances (objects, observations, etc.), and columns correspond to features for each of the instances.


In [8]:
import numpy as np
import pandas as pd
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')


We’ll demonstrate the main methods in action by analyzing a [dataset](https://bigml.com/user/francisco/gallery/dataset/5163ad540c0b5e5b22000383) on the churn rate of telecom operator clients. Let’s read the data (using `read_csv`), and take a look at the first 5 lines using the `head` method:


In [21]:
df = pd.read_csv("../input/telecom_churn.csv")
df.head()
#print(df.head())

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


<details>
<summary>About printing DataFrames in Jupyter notebooks</summary>
<p>
In Jupyter notebooks, Pandas DataFrames are printed as these pretty tables seen above while `print(df.head())` looks worse.
By default, Pandas displays 20 columns and 60 rows, so, if your DataFrame is bigger, use the `set_option` function as shown in the example below:

```python
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
```
</p>
</details>

Recall that each row corresponds to one client, the **object** of our research, and columns are **features** of the object.

**Let’s have a look at data dimensionality, features names, and feature types.**

In [22]:
print(df.shape)

(3333, 20)
(3333, 20)


From the output, we can see that the table contains 3333 rows and 20 columns.

Now let’s try printing out the column names using `columns`:

In [24]:
print(df.columns)

Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')


We can use the `info()` method to output some general information about the dataframe: 

In [25]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
State                     3333 non-null object
Account length            3333 non-null int64
Area code                 3333 non-null int64
International plan        3333 non-null object
Voice mail plan           3333 non-null object
Number vmail messages     3333 non-null int64
Total day minutes         3333 non-null float64
Total day calls           3333 non-null int64
Total day charge          3333 non-null float64
Total eve minutes         3333 non-null float64
Total eve calls           3333 non-null int64
Total eve charge          3333 non-null float64
Total night minutes       3333 non-null float64
Total night calls         3333 non-null int64
Total night charge        3333 non-null float64
Total intl minutes        3333 non-null float64
Total intl calls          3333 non-null int64
Total intl charge         3333 non-null float64
Customer service calls    3333 non-null int64



`bool`, `int64`, `float64` and `object` are the data types of our features. We see that one feature is logical (`bool`), 3 features are of type `object`, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 3333 observations, the same number of rows we saw before with `shape`.

We can **change the column type** with the `astype` method. Let’s apply this method to the `Churn` feature to convert it into `int64`:


In [37]:
df.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


In [31]:
df['Churn'] = df['Churn'].astype('int64')

In [32]:
df.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0



The `describe` method shows basic statistical characteristics of each numerical feature (`int64` and `float64` types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [46]:
df.describe()
# 50 percentile is the same as the median
# https://www.google.com/search?q=pandas+describe%E5%87%BD%E6%95%B0+25+%E7%99%BE%E5%88%86%E4%BD%8D%E6%95%B0&oq=pandas+describe%E5%87%BD%E6%95%B0+25+%E7%99%BE%E5%88%86%E4%BD%8D%E6%95%B0&aqs=chrome..69i57j69i60.268j0j9&sourceid=chrome&ie=UTF-8

Unnamed: 0,Account length,Area code,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856,0.144914
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491,0.352067
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0,0.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0,0.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0,0.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0,1.0


In [57]:
s = pd.Series([1,2,3,8])

i=1
j=8
print(i+ (j - i) * 0.75)
s.describe()

6.25


count    4.000000
mean     3.500000
std      3.109126
min      1.000000
25%      1.750000
50%      2.500000
75%      4.250000
max      8.000000
dtype: float64

In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the `include` parameter.

In [None]:
df.describe(include=['object', 'bool'])

For categorical (type `object`) and boolean (type `bool`) features we can use the `value_counts` method. Let’s have a look at the distribution of `Churn`:

In [None]:
df['Churn'].value_counts()

2850 users out of 3333 are loyal; their `Churn` value is `0`. To calculate the proportion, pass `normalize=True` to the `value_counts` function.

In [None]:
df['Churn'].value_counts(normalize=True)


### Sorting

A DataFrame can be sorted by the value of one of the variables (i.e columns). For example, we can sort by Total day charge (use `ascending=False` to sort in descending order):


In [None]:
df.sort_values(by='Total day charge', ascending=False).head()

Alternatively, we can also sort by multiple columns:

In [None]:
df.sort_values(by=['Churn', 'Total day charge'],
        ascending=[True, False]).head()


### Indexing and retrieving data

DataFrame can be indexed in different ways. 

To get a single column, you can use a `DataFrame['Name']` construction. Let's use this to answer a question about that column alone: **what is the proportion of churned users in our dataframe?**



In [None]:
df['Churn'].mean()


14.5% is actually quite bad for a company; such a churn rate can make the company go bankrupt.

**Boolean indexing** with one column is also very convenient. The syntax is `df[P(df['Name'])]`, where `P` is some logical condition that is checked for each element of the `Name` column. The result of such indexing is the DataFrame consisting only of rows that satisfy the `P` condition on the `Name` column. 

Let’s use it to answer the question:

**What are average values of numerical variables for churned users?**


In [None]:
df[df['Churn'] == 1].mean()

**How much time (on average) do churned users spend on phone during daytime?**

In [None]:
df[df['Churn'] == 1]['Total day minutes'].mean()


**What is the maximum length of international calls among loyal users (`Churn == 0`) who do not have an international plan?**



In [None]:
df[(df['Churn'] == 0) & (df['International plan'] == 'No')]['Total intl minutes'].max()


DataFrames can be indexed by column name (label) or row name (index) or by the serial number of a row. The `loc` method is used for **indexing by name**, while `iloc()` is used for **indexing by number**.

In the first case, we would say *"give us the values of the rows with index from 0 to 5 (inclusive) and columns labeled from State to Area code (inclusive)"*, and, in the second case, we would say *"give us the values of the first five rows in the first three columns (as in typical Python slice: the maximal value is not included)"*.


In [None]:
df.loc[0:5, 'State':'Area code']

In [None]:
df.iloc[0:5, 0:3]

If we need the first or last line of the data frame, we can use the `df[:1]` or `df[-1:]` construct:



In [None]:
df[-1:]


### Applying Functions to Cells, Columns and Rows

**To apply functions to each column, use `apply()`:**


In [None]:
df.apply(np.max) 

The `apply` method can also be used to apply a function to each line. To do this, specify `axis=1`. Lambda functions are very convenient in such scenarios. For example, if we need to select all states starting with W, we can do it like this:

In [None]:
df[df['State'].apply(lambda state: state[0] == 'W')].head()

The `map` method can be used to **replace values in a column** by passing a dictionary of the form `{old_value: new_value}` as its argument:

In [None]:
d = {'No' : False, 'Yes' : True}
df['International plan'] = df['International plan'].map(d)
df.head()

The same thing can be done with the `replace` method:

In [None]:
df = df.replace({'Voice mail plan': d})
df.head()


### Grouping

In general, grouping data in Pandas goes as follows:



```python
df.groupby(by=grouping_columns)[columns_to_show].function()
```


1. First, the `groupby` method divides the `grouping_columns` by their values. They become a new index in the resulting dataframe.
2. Then, columns of interest are selected (`columns_to_show`). If `columns_to_show` is not included, all non groupby clauses will be included.
3. Finally, one or several functions are applied to the obtained groups per selected columns.

Here is an example where we group the data according to the values of the `Churn` variable and display statistics of three columns in each group:

In [None]:
columns_to_show = ['Total day minutes', 'Total eve minutes', 
                   'Total night minutes']

df.groupby(['Churn'])[columns_to_show].describe(percentiles=[])

Let’s do the same thing, but slightly differently by passing a list of functions to `agg()`:

In [None]:
columns_to_show = ['Total day minutes', 'Total eve minutes', 
                   'Total night minutes']

df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, 
                                            np.max])


### Summary tables

Suppose we want to see how the observations in our sample are distributed in the context of two variables - `Churn` and `International plan`. To do so, we can build a **contingency table** using the `crosstab` method:



In [None]:
pd.crosstab(df['Churn'], df['International plan'])

In [None]:
pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True)

We can see that most of the users are loyal and do not use additional services (International Plan/Voice mail).

This will resemble **pivot tables** to those familiar with Excel. And, of course, pivot tables are implemented in Pandas: the `pivot_table` method takes the following parameters:

* `values` - a list of variables to calculate statistics for,
* `index` – a list of variables to group data by,
* `aggfunc` — what statistics we need to calculate for groups - e.g sum, mean, maximum, minimum or something else.

Let’s take a look at the average number of day, evening, and night calls by area code:

In [None]:
df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'],
               ['Area code'], aggfunc='mean')


### DataFrame transformations

Like many other things in Pandas, adding columns to a DataFrame is doable in many ways.

For example, if we want to calculate the total number of calls for all users, let’s create the `total_calls` Series and paste it into the DataFrame:



In [None]:
total_calls = df['Total day calls'] + df['Total eve calls'] + \
    df['Total night calls'] + df['Total intl calls']
df.insert(loc=len(df.columns), column='Total calls', value=total_calls) 
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()

It is possible to add a column more easily without creating an intermediate Series instance:

In [None]:
df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + \
    df['Total night charge'] + df['Total intl charge']

df.head()

To delete columns or rows, use the `drop` method, passing the required indexes and the `axis` parameter (`1` if you delete columns, and nothing or `0` if you delete rows). The `inplace` argument tells whether to change the original DataFrame. With `inplace=False`, the `drop` method doesn't change the existing DataFrame and returns a new one with dropped rows or columns. With `inplace=True`, it alters the DataFrame.

In [None]:
# get rid of just created columns
df.drop(['Total charge', 'Total calls'], axis=1, inplace=True) 
# and here’s how you can delete rows
df.drop([1, 2]).head() 


## 4. First attempt on predicting telecom churn


Let's see how churn rate is related to the *International plan* variable. We’ll do this using a `crosstab` contingency table and also through visual analysis with `Seaborn` (however, visual analysis will be covered more thoroughly in the next article).


In [None]:
pd.crosstab(df['Churn'], df['International plan'], margins=True)

In [None]:
# some imports and "magic" commands to set up plotting 
%matplotlib inline 
import matplotlib.pyplot as plt
# pip install seaborn 
import seaborn as sns
plt.rcParams['image.cmap'] = 'viridis'

In [None]:
sns.countplot(x='International plan', hue='Churn', data=df);


We see that, with *International Plan*, the churn rate is much higher, which is an interesting observation! Perhaps large and poorly controlled expenses with international calls are very conflict-prone and lead to dissatisfaction among the telecom operator's customers.

Next, let’s look at another important feature – *Customer service calls*. Let’s also make a summary table and a picture.

In [None]:
pd.crosstab(df['Churn'], df['Customer service calls'], margins=True)

In [None]:
sns.countplot(x='Customer service calls', hue='Churn', data=df);


Perhaps, it is not so obvious from the summary table, but the picture clearly states that the churn rate strongly increases starting from 4 calls to the service center. 

Let’s now add a binary attribute to our DataFrame – `Customer service calls > 3`. And once again, let's see how it relates to churn. 


In [None]:
df['Many_service_calls'] = (df['Customer service calls'] > 3).astype('int')

pd.crosstab(df['Many_service_calls'], df['Churn'], margins=True)

In [None]:
sns.countplot(x='Many_service_calls', hue='Churn', data=df);


Let’s construct another contingency table that relates *Churn* with both *International plan* and freshly created *Many_service_calls*.



In [None]:
pd.crosstab(df['Many_service_calls'] & df['International plan'] , df['Churn'])

Therefore, predicting that a customer is loyal (*Churn*=0) in the case when the number of calls to the service center is less than 4 and the *International Plan* is added (and predicting *Churn*=1 otherwise), we might expect an accuracy of 85.8% (we are mistaken only 464 + 9 times). This number, 85.8%, that we got with very simple reasoning serves as a good starting point (*baseline*) for the further machine learning models that we will build. 

As we move on in this course, recall that, before the advent of machine learning, the data analysis process looked something like this. Let's recap what we've covered:
    
- The share of loyal clients in the sample is 85.5%. The most naive model that always predicts a "loyal customer" on such data will guess right in about 85.5% of all cases. That is, the proportion of correct answers (*accuracy*) of subsequent models should be no less than this number, and will hopefully be significantly higher;
- With the help of a simple forecast that can be expressed by the following formula: "International plan = True & Customer Service calls > 3 => Churn = 1, else Churn = 0", we can expect a guessing rate of 85.8%, which is just above 85.5%. Subsequently, we'll talk about decision trees and figure out how to find such rules **automatically** based only on the input data;
- We got these two baselines without applying machine learning, and they’ll serve as the starting point for our subsequent models. If it turns out that with enormous efforts, we increase the share of correct answers by 0.5% per se, then perhaps we are doing something wrong, and it suffices to confine ourselves to a simple model with two conditions;
- Before training complex models, it is recommended to manipulate the data a bit, make some plots, and check simple assumptions. Moreover, in business applications of machine learning, they usually start with simple solutions and then experiment with more complex ones.


## 5. Assignment #1

In the first assignment, you'll analyze the UCI Adult data set containing demographic information about the US residents. We suggest that you complete the tasks in the [Jupyter notebook](http://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/assignments_demo/assignment01_pandas_uci_adult.ipynb) (or [this](https://www.kaggle.com/kashnitsky/assignment-1-pandas-and-uci-adult-dataset) Kaggle Kernel), and then answer 10 questions in the [Google form](https://docs.google.com/forms/d/1ws9mchvdVGRyva_y_cPjASED8ATZTOsQFKfimohNaFE). You can edit your responses even after submitting the form.


## 6. Useful resources

* First of all, of course, the [official documentation of Pandas](http://pandas.pydata.org/pandas-docs/stable/index.html)
* Medium ["story"](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68) based on this notebook
* If you read Russian: an [article](https://habrahabr.ru/company/ods/blog/322626/) on Habrahabr with ~ the same material. And a [lecture](https://youtu.be/dEFxoyJhm3Y) on YouTube
* [10 minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
* [Pandas cheatsheet PDF](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
* GitHub repos: [Pandas exercises](https://github.com/guipsamora/pandas_exercises/) and ["Effective Pandas"](https://github.com/TomAugspurger/effective-pandas)
* [scipy-lectures.org](http://www.scipy-lectures.org/index.html) — tutorials on pandas, numpy, matplotlib and scikit-learn
