<img src="https://miro.medium.com/max/12032/0*gqpwhigkx0DvM5rh" align="right" width="300" length="200" />
<p style="text-align: center; font-size:30px">BIZTORY DATA SCIENCE PRACTICE</p>
<img src="https://www.biztory.com/hs-fs/hubfs/biztory-logo-large.png?width=500&name=biztory-logo-large.png" width="400" length="200" />

# BDSP - 🐍 Python Workshop
This is the handout document for the Python training organized by the BDSP on the 10th of August of 2020.

<img src="https://datarebellion.com/wp-content/uploads/2018/04/anaconda-logo-300x300.png" alt="Anaconda Logo" align="left" width="200" length="200"/>

## What is Anaconda?

[Anaconda](https://anaconda.org/) is a free and open-source distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. It is essentially a data science distribution with all bateries included. It has features that appeal to beginners and to more advanced users. 

<img src="https://gtrt7.com/blog/wp-content/uploads/2017/10/ic_jupyter.png" alt="Anaconda Logo" align="left" width="200" length="200"/>

## What is a Jupyter Notebook?
[Jupyter](https://jupyter.org/) is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. It is the perfect mix between code and story telling. They are quite popular with beginners because it allows you to run code in independent chunks. While other IDEs (Interactive Code Enviroments) will only run the whole code you write at once. You might know them as IPython. 

<img src="https://lh3.googleusercontent.com/proxy/Cwu7ePPf1lOenogva0qFBULGlfhzHjthHCfXc4LXQPNvjLFDKAZMgCwG_IncoSRgxx08wIhbi5yeENlAV1HFrW8pFu72Fw3q4edtiQkMcg" alt="DataFrame" align="left" width="200" length="200"/>

## What is a DataFrame?
[DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) are a 2-dimensional labeled data structure with columns of potentially different types. DataFrame behave exactly like spreadsheets or SQL tables. It is generally the most commonly used pandas object. Also noteworthy, DataFrames are not unique to Python. But whenever you hear DataFrames and Python, you know that everything is done with the Pandas package. 

<img src="https://pythonawesome.com/content/images/2018/05/pandas-logo.png" alt="Pandas" align="left" width="200" length="200"/>

## What is Pandas?
[Pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It is the undisputed library for Python users to work with data. 

## Python Libraries
Python is a very versatible programming language. It can analyze data, create video games, build a website, and even give a robot instructions. The reason it is so versatile is the libraries that people have built. Libraries are essentially functions that other people built before you, so you don't have to reinvent the wheel. Some functions can contain hundreds or even thousands of lines of code. 

The diagram below displays the most common libraries in 2020. Check the documentation to see what each library does and which functions it allows you to use. 

<img src="https://academy.vertabelo.com/blog/top-python-libraries-2020/TopPythonLibraries980x400_v4_hu12044cd8842ae4843c303e7a089b6784_155193_980x400_fill_box_center_2.png" alt="Python libraries" width="1000" length="200"/>

## How can you install libraries?
Python has a lot of libraries included and Anaconda pre-installs lots of useful libraries for you to use. But if you want to be 100% sure, you might want to install the libraries that you need. Luckily, this is just one line of code for you! 

The only decision you have to make is if you want to use [conda](https://docs.conda.io/en/latest/) (Anaconda's package manager) or [pip](https://pypi.org/project/pip/) (Python's default package manager). Regardless of what you choose, it is as simple as the code snippets bellow.

<em>Please note the exclamation mark at beginning of the line tells Jupyter that the coming line is not a Python command, but rather a terminal command. In Python, packages are installed in the command line.</em> **Warning**: The exclamation mark is not a Python feature, this is a Jupyter feature (it is one of the so-called Jupyter Magic commands). If you are curious about other Jupyter magic commands, check the documentation [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-system). 

In [None]:
# Installing packages using PIP
!pip install pandas

# Installing packages using CONDA
!conda install pandas

## How to load libraries?
Once the libraries have been installed in your computer, you need to import them to use them in your code. There are multiple options for importing code, but these are the most common. 

#### Import a whole library 
Example: ```import pandas```

However, this approach is not recommended. It is convinient if you will use many functions or you don't know for which function you are looking for. But the downsite is that you will load lots of functions that you don't need. Ultimately, make your code heavier and more difficult to debug.

#### Adding an alias 
Example: ```import pandas as pd```

When you use a function coming from a library, you have to reference where that function comes from. For example, when you use the function *read_csv* from the package pandas, you might have to type ```pandas.read_csv()```. To avoid typing the word pandas every time, you can give a shorter alias like pd. If you import the package with an alias, typing the same function would become ```pd.read_csv()```. 

#### Importing only what you need
Example: ```from math import pi```

This is the prefered approach. Only put in your plate what you are going to eat. Interestingly, when you only import one function, you don't need anymore to add the name of the package at the beginning of the function. In other words, when you type ```from math import pi```, you won't have to type ```math.pi```, you can just type ```pi``` and it will work.

In [1]:
# For today, we will use many functions of the pandas package
# So we will load all of them and give them the pd alias.

import pandas as pd

## How to load data?

In [2]:
# Open the train data
train = pd.read_csv('data/train.csv')

# Open the test data
test = pd.read_csv('data/test.csv')

## What are Python's Data Types?

In [19]:
# Check the data types in the train dataset
train.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

# How to explore a dataframe?

In [None]:
df.shape
df.columns
df.info()
df.describe()
df.head()
df.tail()

## How to filter a dataframe?

In [None]:
Df[1:]
Df.iloc([0], [0])
Df.loc([0], [‘Column’])
df[df[‘Column’] > Value]