# Introduction to Jupyter and Pandas
**Jupyter notebooks** are a widely used tool in Data Science projects. They are a great way of organizing data explorations and code in the same place. A very powerful (and simple) feature, is the possibility of mixing code and markdown, so computation and explanations go on the same place. 

**pandas** is a Python library used for manipulation, exploration and processing of data.

## Quick tips

In a code cell, writing the name of an object and pressing `.<TAB>` will show you the available methods. For example:

![tab completion](../data/misc/tab_completion.png)


You can also add a `?` at the end of any method and execute the cell to see the documentation of that method. For example:

![String documentation](../data/misc/string_doc.png)

## (very) quick pandas overview

In [None]:
# Import pandas
import pandas as pd

# Read data, usually as a CSV file
df = pd.read_csv('../data/housing/train.csv', index_col=['Id'])

The basic unit of work in pandas are **DataFrames**. You can think of a DataFrame as a table in a database. Actually, you can merge, join and query DataFrames as you would to with tables.

Each row in a DataFrame is an observation or sample. Rows and columns are objects called pandas **Series**. 

You can have a look at the first couple of observations in a DataFrame with the `.head()` method:

In [None]:
df.head()

There are many other useful methods:

`.describe()` shows a quick statistic summary of your data. NOTE that only does so for
numerical data

In [None]:
df.describe()

`.info()` shows general information about types and NULLs

In [None]:
df.info()

`.sort_values` allows you to sort your DataFrame in many ways. For example by the values of a particular column

In [None]:
df.sort_values(by=['MSSubClass'], ascending=False).head()

#### Selection
You can select data in many ways. Most useful are:

Select a particular subset of columns:

In [None]:
df[['MSSubClass', 'Utilities']].head()

Or locate a particular index:

In [None]:
df.loc[1]

or by position in the index column...

In [None]:
df.iloc[1]

You can also do boolean indexing, and concatenate queries:

In [None]:
df[(df.LotFrontage > 70) & df.PoolArea > 0]

## More in https://pandas.pydata.org/pandas-docs/stable/