# Data science. 
----

<img src="images/vennmaster.png" width="400" height="200">
<img src="images/datasiciencevenn.jpeg" width="600" height="400">


Latin America
Junior rol : 1000 USD
Senior rol : 4000 USD

## Stages of data science.

1. Hacer la pregunta adecuada
2. Gathering data.
3. Clean data.
5. Exploring.
6. Model and assest.
7. Deploy and share knoledge.


<center>
<img src="images/numpy.png" width="200" height="100">
<img src="images/pandas.png" width="200" height="100">
<img src="images/matplotlib.svg" width="200" height="100">
<img src="images/sklearn.png" width="200" height="100">
</center>

## Numpy
----

<img src="images/numpy.png" width="200" height="100">

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, and some funtions allows matrix operations, fourier transforms, random numbers, etc.
Install: `pip install numpy` https://numpy.org

* Array is a central data structure of the NumPy library. 
* NumPy arrays have a fixed size at creation, unlike Python lists.
* NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use.
* While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogeneous.  

In [13]:
import numpy as np

### Creating arrays.

In [67]:
a = np.array([1, 2, 3])
print(a)
print(a.shape)

[1 2 3]
(3,)


In [72]:
# Creating a 2 dimensional array
a = np.array([[1, 2, 3, 4, 5],
              [6, 7, 8, 9, 10],
              [11, 12, 13, 14, 15]])

print('Data type of a is: ', type(a))
print('Shape of a is: ', a.shape)
# Taking a slice
print(a[0])

Data type of a is:  <class 'numpy.ndarray'>
Shape of a is:  (3, 5)
[1 2 3 4 5]


In [41]:
# Create an array with values that are spaced linearly
b = np.linspace(0, 10, num = 21)
print(b)

[ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5  5.   5.5  6.   6.5
  7.   7.5  8.   8.5  9.   9.5 10. ]


In [48]:
# Concatenate 2 arrays
a = np.array([[1, 2], 
              [3, 4]])
b = np.array([[5, 6]])
c = np.concatenate((a, b), axis=0)
print(c)

[[1 2]
 [3 4]
 [5 6]]


In [73]:
# Ndarray (n-dimensional array) 
a = np.array([[[0, 1, 2, 3],
               [4, 5, 6, 7]],

              [[0, 1, 2, 3],
               [4, 5, 6, 7]],

              [[0 ,1 ,2, 3],
               [4, 5, 6, 7]]])
print('How many dimensions: ', a.ndim)
print(a.shape)
print(a.size)

How many dimensions:  3
(3, 2, 4)
24


### Reshape

In [62]:
# Reshape an array
a = np.arange(6)
print(a)
print('Shape a: ', a.shape)

b = a.reshape(2,3)
print(b)
print('Shape b: ', b.shape)

a2 = a[np.newaxis, :]
print('New shape of a: ',a2.shape)

[0 1 2 3 4 5]
Shape a:  (6,)
[[0 1 2]
 [3 4 5]]
Shape b:  (2, 3)
New shape of a:  (1, 6)


### Basic Operators

In [101]:
a = np.array([1, 2])
b = np.array([3, 4])
c = np.array([[1, 2],
             [3, 4]])
print(a+b)
print(a-b)
print(a*b)
print(a/b)
print(a.sum())
print(c.sum(axis=0))
print(c.min())
print(c.max())

[4 6]
[-2 -2]
[3 8]
[0.33333333 0.5       ]
3
[4 6]
1
4
(2,)
(2,)


### Broadcasting
Broadcasting is a mechanism that allows NumPy to perform operations on arrays of different shapes.

In [130]:
a = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]])
print(a * 8)

[[ 8 16 24 32]
 [40 48 56 64]
 [72 80 88 96]]


In [96]:
a = np.array([[1, 2],
              [3, 4],
              [5, 6]])
print(a.shape)
print(a+1)

(3, 2)
[[2 3]
 [4 5]
 [6 7]]


In [105]:
a = np.random.random((3,2))
print(a)
print(a.flatten())

[[0.02082434 0.33634902]
 [0.01675268 0.02722666]
 [0.77916537 0.02752493]]
[0.02082434 0.33634902 0.01675268 0.02722666 0.77916537 0.02752493]


In [112]:
np.arange(12).reshape(4, -1).shape

(4, 3)

### Vectorization.

$$ a \cdot b=\sum_{i=1}^{n} a_{i} b_{i} $$


In [169]:
import time
a = np.random.random(10000000)
b = np.random.random(10000000)
# Vetorizted version. 
tic = time.time()
c = np.dot(a,b)
toc = time.time()
print('Vectorized: ',(toc-tic)*1000, 'Result', round(c, 2))

# Normal Version.
tic = time.time()
c = 0
for i in range(10000000):
    c += a[i]*b[i]
toc = time.time()
print('Normal : ', (toc-tic)*1000, 'Result', round(c, 2))



Vectorized:  12.17794418334961 Result 2499347.75
Normal :  1885.524034500122 Result 2499347.75


## Pandas
----

<img src="images/pandas.png" width="200" height="100">

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Install: `pip install pandas` https://pandas.pydata.org

* The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional).
* Pandas is built on top of NumPy.
* Easy handling of missing data (represented as NaN)
* DataFrame is a container for Series, and Series is a container for scalars.

<center>
<img src="images/pandas_01.svg" width="600" height="300">
</center>

In [142]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [143]:
import pandas as pd

In [163]:
df = pd.read_csv('https://raw.githubusercontent.com/unalyticsteam/datasets/master/titanic.csv')
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [164]:
df.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [165]:
# technical summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Survived                 887 non-null    int64  
 1   Pclass                   887 non-null    int64  
 2   Name                     887 non-null    object 
 3   Sex                      887 non-null    object 
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64  
 6   Parents/Children Aboard  887 non-null    int64  
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB


In [166]:
df['Survived'] = df['Survived'].astype('object')
df['Pclass'] = df['Pclass'].astype('object')
df['Sex'] = df['Sex'].astype('object')

In [167]:
df[df['Age']>20]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
880,0,3,Mr. Henry Jr Sutehall,male,25.0,0,0,7.0500
881,0,3,Mrs. William (Margaret Norton) Rice,female,39.0,0,5,29.1250
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


In [168]:
df[df['Pclass'].isin([2, 3])]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.0750
...,...,...,...,...,...,...,...,...
880,0,3,Mr. Henry Jr Sutehall,male,25.0,0,0,7.0500
881,0,3,Mrs. William (Margaret Norton) Rice,female,39.0,0,5,29.1250
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500


In [162]:
df[['cyl', 'hp']].groupby('cyl').mean()

Unnamed: 0_level_0,hp
cyl,Unnamed: 1_level_1
4,82.636364
6,122.285714
8,209.214286


<img src="images/matplotlib.svg" width="200" height="100">

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy
https://matplotlib.org/

<img src="images/sklearn.png" width="200" height="100">

Scikit-learn is a free software machine learning library for the Python. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN. https://scikit-learn.org/stable/