# Librerías de Python

Python, al igual que otros lenguajes de programación, tiene una gran cantidad de módulos o Librerías adicionales que aumentan el marco base y la funcionalidad del lenguaje.

Piense en una Librería como una colección de funciones a las que se puede acceder para completar ciertas tareas de programación sin tener que escribir su propio algoritmo.

Nosotros centraremos principalmente en las siguientes bibliotecas:

* Numpy :  Una Librería para trabajar con matrices de datos.

* Pandas:  Proporciona estructuras de datos y herramientas de análisis de datos de alto rendimiento y fáciles de usar.

* Scipy: Una Librerías de técnicas de computación numérica y científica.

* Matplotlib: Una Librería para hacer gráficos.

* Seaborn: Una interfaz de nivel superior para Matplotlib que se puede utilizar para simplificar muchas tareas gráficas.

* Statsmodels:  Una Librería que implementa muchas técnicas estadísticas.
* Plotly: La Librería de gráficos Python de Plotly crea gráficos interactivos con calidad de publicación.
* Dash:  Una Librería abierta para construir interfaces de visualización de datos.

# Documentación

La documentación confiable y accesible es una necesidad absoluta cuando se trata de la transferencia de conocimientos de lenguajes de programación. Afortunadamente, Python proporciona una cantidad significativa de documentación detallada que explica los entresijos de la sintaxis del lenguaje, las Librerías y más.

Comprender cómo leer la documentación es crucial para cualquier programador, ya que servirá como un recurso fantástico para aprender las complejidades de Python.

Aquí está el enlace a la documentación de la biblioteca estándar de Python: [Biblioteca estándar de Python](https://docs.python.org/3/library/index.html#library-index)


### Importación de Librerías

Al usar Python, siempre debe comenzar sus scripts importando las Librerías que usará.

La siguiente declaración importa la Librerías numpy y pandas, y les da nombres abreviados:

In [1]:
import numpy as np
import pandas as pd

### Utilizing Library Functions

After importing a library, its functions can then be called from your code by prepending the library name to the function name.  For example, to use the '`dot`' function from the '`numpy`' library, you would enter '`numpy.dot`'.  

> To avoid repeatedly having to type the libary name in your scripts, it is conventional to define a two or three letter abbreviation for each library: 
* e.g. '`numpy`' is usually abbreviated as '`np`'.  
    * This allows us to use '`np.dot`' instead of '`numpy.dot`'.  
* Similarly, the Pandas library is typically abbreviated as '`pd`'
    * This allows us to use '`pd.read_csv`' instead of '`pandas.read_csv`'.

In [2]:
a = np.array([0,1,2,3,4,5,6,7,8,9,10]) 
np.mean(a)
type(a)

numpy.ndarray

In [3]:
a = complex(2, 3)
b = 3 + 4j
c = np.dot(a, b)  # 2*3 - 12 + 9j+ 8j = -6+17j

In [4]:
print('a \t : ', a)
print('b \t : ', b)
print('c \t : ', c)
#d = np.dot([2, 3j], [4, 5j])  is wrong. 

a 	 :  (2+3j)
b 	 :  (3+4j)
c 	 :  (-6+17j)


In [5]:
d = np.dot(3j, 4j)
d

(-12+0j)

In [6]:
x = 2 + 3j
y = 4 + 5j
np.dot(x,y)

(-7+22j)

(2 + 3j)* (4 + 5j) = 8 - 15 + 10j + 12j 
= -7 + 22j 

More on [numpy.dot](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) can be seen [here](https://numpy.org/doc/stable/reference/generated/numpy.dot.html). 

# Gestión de datos

La gestión de datos es un componente crucial para el trabajo de análisis estadístico y ciencia de datos. El siguiente código mostrará cómo importar datos a través de la biblioteca de pandas, ver sus datos y transformarlos.

*  La estructura de datos principal con la que trabaja Pandas se llama **DataFrames** (df). Se trata de una tabla de datos bidimensional en la que las filas suelen representar casos (p. Ej., Participantes del concurso Cartwheel) y las **columnas** representan **variables**.

* Pandas también tiene una estructura de datos unidimensional llamada ``Series`` que encontraremos al acceder a una sola columna de un Marco de datos.

Pandas tiene una variedad de funciones llamadas '`read_xxx`' para leer datos en diferentes formatos. Ahora mismo nos centraremos en leer archivos '`csv`', que significa valores separados por comas. Sin embargo, los otros formatos de archivo incluyen excel, json y sql, solo por nombrar algunos.


Hay muchas otras opciones para '`read_csv`' que son muy útiles. Por ejemplo, usaría la opción `sep ='\t' ` en lugar del predeterminado `sep =','` si los campos de su archivo de datos están delimitados por tabulaciones en lugar de comas. Consulte [aquí](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) para obtener la documentación completa de '`read_csv`'.

### Importing Data

In [7]:
datafile = "https://raw.githubusercontent.com/tec03/Datasets/main/datasets/Cartwheeldata.csv" #route of the source file

df = pd.read_csv(datafile) # Read the .csv file and store it as a pandas Data Frame

type(df) # Output object type

pandas.core.frame.DataFrame

### Viewing Data

We can view our Data Frame by calling the head() function

In [8]:
df.head() 

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,0
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [9]:
df.head(3)

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,0


The head() function simply shows the first 5 rows of our Data Frame.  If we wanted to show the entire Data Frame we would simply write the following:

In [10]:
df

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,0
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
7,8,22,F,1,N,0,65.0,62.0,98,Y,1,9
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
9,10,33,F,1,Y,1,63.0,60.0,65,Y,1,8


As you can see, we have a 2-Dimensional object where each row is an independent observation of our cartwheel data.

To gather more information regarding the data, we can view the column names and data types of each column with the following functions:

In [11]:
print(*df.columns, 
      sep = '\n')

df.columns

ID
Age
Gender
GenderGroup
Glasses
GlassesGroup
Height
Wingspan
CWDistance
Complete
CompleteGroup
Score


Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

In [12]:
for i in df.columns:
    print(i)

ID
Age
Gender
GenderGroup
Glasses
GlassesGroup
Height
Wingspan
CWDistance
Complete
CompleteGroup
Score


In [13]:
c_namesA = df.columns
print(*c_namesA, sep = ', ')

ID, Age, Gender, GenderGroup, Glasses, GlassesGroup, Height, Wingspan, CWDistance, Complete, CompleteGroup, Score


The column names saved in a variable `c_namesA`, and it is separated by `,`. 

In [14]:
c_namesB = list(df.columns)
c_namesB.sort()
print(*c_namesB, sep = '; ')

Age; CWDistance; Complete; CompleteGroup; Gender; GenderGroup; Glasses; GlassesGroup; Height; ID; Score; Wingspan


The column names saved in a variable `c_namesB`, sorted alphabetically, then it is separated by `;` on printing. 

In [15]:
res = "\n".join("{} \t\t {}".format(x, y) for x, y in zip(c_namesA, c_namesB))
print(res)

ID 		 Age
Age 		 CWDistance
Gender 		 Complete
GenderGroup 		 CompleteGroup
Glasses 		 Gender
GlassesGroup 		 GenderGroup
Height 		 Glasses
Wingspan 		 GlassesGroup
CWDistance 		 Height
Complete 		 ID
CompleteGroup 		 Score
Score 		 Wingspan


In [16]:
type(res)

str

Lets say we would like to splice (join/connect) our data frame and select only specific portions of our data.  

There are three different ways of doing so.

1. .loc()
2. .iloc()
3. .ix()

We will cover the .loc() and .iloc() splicing functions.

## .loc()        

### df.loc[row_start : row_end, column_start:column_end]


.loc() takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

Use just "*column_names*". **No indices**

In [17]:
df.columns

Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

In [18]:
df.loc[:,"CWDistance"]

0      79
1      70
2      85
3      87
4      72
5      81
6     107
7      98
8     106
9      65
10     96
11     79
12     92
13     66
14     72
15    115
16     90
17     74
18     64
19     85
20     66
21    101
22     82
23     63
24     67
Name: CWDistance, dtype: int64

In [19]:
df.iloc[:, 8:]

Unnamed: 0,CWDistance,Complete,CompleteGroup,Score
0,79,Y,1,7
1,70,Y,1,8
2,85,Y,1,0
3,87,Y,1,10
4,72,N,0,4
5,81,N,0,3
6,107,Y,1,10
7,98,Y,1,9
8,106,N,0,5
9,65,Y,1,8


In [20]:
df.iloc[:, -4:]

Unnamed: 0,CWDistance,Complete,CompleteGroup,Score
0,79,Y,1,7
1,70,Y,1,8
2,85,Y,1,0
3,87,Y,1,10
4,72,N,0,4
5,81,N,0,3
6,107,Y,1,10
7,98,Y,1,9
8,106,N,0,5
9,65,Y,1,8


In [21]:
df.iloc[:, -4:-2]

Unnamed: 0,CWDistance,Complete
0,79,Y
1,70,Y
2,85,Y
3,87,Y
4,72,N
5,81,N
6,107,Y
7,98,Y
8,106,N
9,65,Y


In [22]:
df.loc[:,['Age', 'Gender', 'GenderGroup']]

Unnamed: 0,Age,Gender,GenderGroup
0,56,F,1
1,26,F,1
2,33,F,1
3,39,F,1
4,27,M,2
5,24,M,2
6,28,M,2
7,22,F,1
8,29,M,2
9,33,F,1


In [23]:
df.loc[:9, ["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [24]:
# Select range of rows for all columns
df.loc[5:10]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
7,8,22,F,1,N,0,65.0,62.0,98,Y,1,9
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
9,10,33,F,1,Y,1,63.0,60.0,65,Y,1,8
10,11,30,M,2,Y,1,69.5,66.0,96,Y,1,6


The .loc() function requires two arguments, the indices of the rows and the column names you wish to observe.

In the above case **:** specifies all rows, and our column is **CWDistance**. df.loc[**:**,**"CWDistance"**]

Now, let's say we only want to return the first 10 observations:

In [25]:
df.loc[:9, "CWDistance"]

0     79
1     70
2     85
3     87
4     72
5     81
6    107
7     98
8    106
9     65
Name: CWDistance, dtype: int64

## .iloc()

### df.iloc[row_start : row_end, column_start, column_end]

.iloc() is integer based slicing, whereas .loc() used labels/column names. 

Use just *column_index*. **No Names**



Here are some examples:

In [26]:
df.iloc[:4]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,0
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10


In [27]:
df.iloc[1:5, 2:4]

Unnamed: 0,Gender,GenderGroup
1,F,1
2,F,1
3,F,1
4,M,2


In [28]:
#df.iloc[1:5, ["Gender", "GenderGroup"]]# is error. column names can't be called with iloc. 
df.iloc[1:5, ]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,0
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


We can view the data types of our data frame columns with by calling .dtypes on our data frame:

In [29]:
df.dtypes

ID                 int64
Age                int64
Gender            object
GenderGroup        int64
Glasses           object
GlassesGroup       int64
Height           float64
Wingspan         float64
CWDistance         int64
Complete          object
CompleteGroup      int64
Score              int64
dtype: object

The output indicates we have integers, floats, and objects with our Data Frame.

We may also want to observe the different unique values within a specific column, lets do this for Gender:

In [30]:
df.Gender.unique()

array(['F', 'M'], dtype=object)

In [31]:
df.Height.unique()

array([62.  , 66.  , 64.  , 73.  , 75.  , 65.  , 74.  , 63.  , 69.5 ,
       62.75, 61.5 , 71.  , 70.  , 68.  , 69.  ])

Lets explore `GenderGroup` as well

In [32]:
df.GenderGroup.unique()

array([1, 2])

It seems that these fields may serve the same purpose, which is to specify male vs. female.

Lets check this quickly by observing only these two columns:

In [33]:
df.loc[:,
       ["Gender", "GenderGroup"]
      ]

Unnamed: 0,Gender,GenderGroup
0,F,1
1,F,1
2,F,1
3,F,1
4,M,2
5,M,2
6,M,2
7,F,1
8,M,2
9,F,1


From eyeballing the output, it seems to check out.  

We can streamline this by utilizing the `groupby()` and `size()` functions.

In [34]:
df.groupby(['Gender','GenderGroup']).size()

Gender  GenderGroup
F       1              12
M       2              13
dtype: int64

This output indicates that we have two types of combinations. 

* Case 1: Gender = F & Gender Group = 1 
* Case 2: Gender = M & GenderGroup = 2.  

This validates our initial assumption that these two fields essentially portray the same information.

In [35]:
df[['Gender','GenderGroup']].value_counts()

Gender  GenderGroup
M       2              13
F       1              12
dtype: int64

In [36]:
df.columns

Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

In [37]:
df[['Gender','GenderGroup', 'Glasses',]].value_counts()

Gender  GenderGroup  Glasses
F       1            Y          8
M       2            N          7
                     Y          6
F       1            N          4
dtype: int64

In [38]:
df.groupby(['Gender','GenderGroup','Glasses']).size()

Gender  GenderGroup  Glasses
F       1            N          4
                     Y          8
M       2            N          7
                     Y          6
dtype: int64

<!--NAVIGATION-->
< [Previous](https://github.com/Egade/ClassNotes/blob/main/013_arrays.ipynb) | [Toc](https://github.com/Egade/ClassNotes) | [Next](https://github.com/Egade/ClassNotes/blob/main/021_data_wrangling.ipynb) >