# Python for Data Science
# Mini Tutorial

## Jupyter

[Jupyter](https://jupyter.org) is a browser front-end connected to an instance of IPython which allows quick testing, allows to create documents that intertwines code, output, images, and text. This is great for prototyping, demonstrations and tutorials, but terrible for actual coding. 

### Hello

In [None]:
print("Hello World!")

In [None]:
a = 5
b = 3.2

In [None]:
print("a: ", a, "Type: ", type(a))

In [None]:
print("b: ", b, "Type: ", type(b))

In [None]:
text = "Hello World!"
print("Text: ", text, type(text))

In [None]:
# Definición de la función para calcular la serie de Fibonacci
def fibonacci(n):
    """
    Devuelve una lista conteniendo la serie de Fibonacci hasta n.
    """
    resultado = []
    a, b = 0, 1
    while a < n:
        resultado.append(a)
        a, b = b, a+b
    return resultado

In [None]:
# Utilización de la función
fib100 = fibonacci(100)
print(fib100)

## LATEX Code

## Teorema de Pitágoras
**Fecha: 495 AC**

Si a y b son catetos y c la hipotenusa de un triángulo rectángulo:

$c^2 = a^2 + b^2$


## HTML Code

<a href="https://www.w3schools.com">Visit W3Schools Julio</a>

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/220px-SNice.svg.png" alt="Smiley face" width="100" height="100" align="left">

## Numpy

[Numpy](http://www.numpy.org) is designed to handle large multidimensional arrays and enable efficient computations with them. In the back, it runs pre-compiled C code which is much faster than, say, a Python `for` loop

In [None]:
import numpy as np

### Indexing and slicing

Numpy arrays can be indexed and sliced like regular python arrays

In [None]:
a_py = [1, 2, 3, 4, 5, 6, 7, 8, 9]
a_np = np.array(a_py)

In [None]:
print(a_py[3:7:2], a_np[3:7:2])
print(a_py[2:-1:2], a_np[2:-1:2])
print(a_py[::-1], a_np[::-1])

But you can also use arrays to index other arrays

In [None]:
idx = np.array([7,2])
a_np[idx]

In [None]:
# a_py[idx]

Which allows convenient querying, reindexing and even sorting

In [None]:
ages = np.random.randint(low=30, high=60, size=10)
heights = np.random.randint(low=150, high=210, size=10)

print(ages)
print(heights)

In [None]:
print(ages < 50)

In [None]:
print(heights[ages < 50])
print(ages[ages < 50])

In [None]:
shuffled_idx = np.random.permutation(10)
print(shuffled_idx)
print(ages[shuffled_idx])
print(heights[shuffled_idx])

In [None]:
sorted_idx = np.argsort(ages)
print(sorted_idx)
print(ages[sorted_idx])
print(heights[sorted_idx])

### Broadcasting

When Numpy is asked to perform an operation between arrays of differents sizes, it "broadcasts" the smaller one to the bigger one.

In [None]:
a = np.array([4, 5, 6])
b = np.array([2, 2, 2])
a * b

In [None]:
a = np.array([4, 5, 6])
b = 2
a * b

The two snippets of code above are equivalent but the second is easier to read and also more efficient.

In [None]:
a = np.arange(10).reshape(1,10)
b = np.arange(12).reshape(12,1)

In [None]:
print(a)
print(b)

In [None]:
print(a * b)

## Matplotlib


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
plt.rcParams['figure.figsize'] = [10, 7]

[Matplotlib](https://matplotlib.org) is the go-to library to produce plots with Python. It comes with two APIs: a MATLAB-like that a lot of people have learned to use and love, and an object-oriented API that we recommend using.

In [None]:
x = np.linspace(-2*np.pi, 2*np.pi, 400)
y = np.tanh(x)
fig, ax = plt.subplots()
ax.plot(x, y)

You can plot multiple subplots in the same figure, or multiple functions in the same subplot

In [None]:
x = np.linspace(0, 2*np.pi, 400)
y1 = np.tanh(x)
y2 = np.cos(x**2)
fig, axes = plt.subplots(1, 2, sharey=True)
axes[1].plot(x, y1)
axes[1].plot(x, -y1)
axes[0].plot(x, y2)

Matplotlib also comes with a lot of different options to customize, the colors, the labels, the axes, etc.

For instance, see this [introduction to matplotlib](https://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb)

## Pandas

[Pandas](http://pandas.pydata.org) is a library that provides a set of tools for data analysis (Python Data Analysis Library). 

In [None]:
import pandas as pd

Pandas dataframes can be created by importing a CSV file (or TSV, or JSON, or SQL, etc.)

In [None]:
df = pd.read_csv("../datasets/adult.csv")

In [None]:
df.head()

In [None]:
df.describe()

Pandas columns are also Numpy arrays, so they obey to the same indexing magic

In [None]:
df[df['age'] > 80]

They also provide most functionality you would expect as database user (`df.sort_values`, `df.groupby`, `df.join`, `df.concat`, etc.)

### Plot Age Histogram

In [None]:
df['age'].hist()

## Plot Income by Gender

In [None]:
df['income_bin'] = df.income == " >50K"
plt.figure()
plt.title("By gender")
grouped = df.groupby("gender")
grouped.income_bin.mean().plot.barh()

In [None]:
fig, ax = plt.subplots(1,2)
ax[0].hist(df['age'], label="age")
ax[0].set_xlabel("age")
ax[1].hist(df['education-num'], label="education-num")
ax[1].set_xlabel("education-num")

## Other packages 

- [Plotly](https://plot.ly) and [Seaborn](http://seaborn.pydata.org): two other plotting libraries
- [Scipy](https://www.scipy.org): a science library built on top of Numpy
- [Scrapy](https://www.scrapy.org): a web crawling library