# Python
Python is a general propose programing language. In these workshops we will use Python 3.8 or higher. Python is an interpreted language as JavaScript and in opposition to Java or C which need to be compiled to run; this implies that some of the errors detected during the source compilation will not be noticed and will generate exceptions during execution.

In this notebook we are going to see some very basic examples of how to use Python and some of its libraries, to get a detailed view of the language check https://docs.python.org/3/tutorial/

## Variables and types
Python is a strongly typed language. Python has 4 basic types: integer number `int`, decimal numbers `float`, booleans `bool` and text strings `str`. When we define a variable, we will not specify the type.

In [None]:
# This is a comment
# Comments starts with the character #

# This frame is a cell which can be executed individually.
# To execute a cell press Ctrl+Enter or press the run button in the tools menu

print("Hello World!")

In [None]:
# An integer
myInt = 10

# A float value
myFloat = 3.14

# A boolean
myBool = True

# A string, you can use simple quotes ' or double quotes "
myStr = "example1"
myStr2 = 'example2'


To write in the standard output, in other words to print something, we will use the function `print()`

In [None]:
print(myInt)
print(myFloat)
print(myBool)
print(myStr)
print(myStr2)

# As Python is strongly typed variables will not change its type automatically, so, for example, if we want to concatenate
# an int with an str, we have to convert the int into a str first
print(myStr + " " + str(myInt))

## Basic operations

In [None]:
# Add
print(3 + 5)

# Subtract
print(3 - 5)

# Multiply
print(3 * 5)

# Divide
print(3 / 5)

# Power 
print(3 ** 5)

## Lists

In [None]:
myList = [1, 2, 3, True, "APSV"]

# Access an element in certain position
print(myList[0])

# We can access elements at the end by using negative index. -1 is the last value, -2 the previous one, etc
print(myList[-1])

# We can obtain a sublist using this syntax myList[start:end]
# If start is not specified the beginning of the list will be used, analogously with end and the end of the list
print(myList[0:2])
print(myList[-3:-1])
print(myList[:2])
print(myList[2:])

In [None]:
# To modify a value
myList[0] = 4
print(myList)

# To add elements at the end of the list
myList.append("CRIS")
print(myList)

# To add elements into a certain position
myList.insert(1,23)
print(myList)

# To remove the last element
lastElement = myList.pop()
print(myList)

# To remove a certain element
del(myList[1])
print(myList)

# To remove an element by value
myList.remove(3)
print(myList)

# To obtain the length of any collection
print(len(myList))

## Dictionaries
Dictionaries are collections of key-value pairs. These collections can be accessed like a list but using the keys instead of the indexes.

The values can have different types, even is common to have dictionaries inside dictionaries (eg. to work with a json).

In [None]:
# Dictionaries are marked with {} characters
# The syntax to set a key-value pair is "key": "value"
myDict = { "key": "value", "APSV": 23, "arr" : [1,2,3] }
print(myDict)

# Access by key
print(myDict["arr"])

## Boolean operations

In [None]:
# AND
print(myInt < 5 and myBool)

# OR
print(myInt > 5 or myBool)

# Negate
print(not myBool)

# Check element in list
print(4 in myList)

## Flow control
In Python, code blocks are defined by their indentation level in opposition of other languages which surrounds the blocks with `{}`.

In [None]:
# Conditional
if myInt > 5:
    print("myInt is greater than 5")

In [None]:
# Conditional with else
if myInt < 5:
    print("myInt is less than 5")
else:
    print("myInt is greater than 5")

In [None]:
# Nested conditionals
if myInt < 5:
    print("myInt is less than 5")
elif myInt < 15:
    print("myInt is less than 15")
else:
    print("myInt is greater than 15")

In [None]:
# For loop to iterate over list elements
for i in myList:
    print(i)

In [None]:
# For loop to iterate over values in a range
for i in range(len(myList)):
    print(str(i) + " - " + str(myList[i]))

In [None]:
# While loop
a = 0
while a < 10:
    print(a)
    a += 1

In [None]:
# Break and continue
a = 0
while True:
    a += 1
    if a > 10:
        break
    if a % 2 == 0:
        continue
    print(a, "is odd")

# Methods
To define a method in python we will use the following syntax `def methodName(arg1, arg2 = "default value", arg3)`.

In [None]:
def sumValues(a, b=1):
    return a + b

print(sumValues(2,3))
print(sumValues(2))

## Import modules
In Python external dependencies are usually called modules, to import them we will use `import moduleName`

If you are interested in know more about Python or and additional modules you may need to install them, to do that Python have a command named `pip` to install external libraries (similar to `npm` in Javascript) that you can run in the Anaconda terminal.

`conda install [name of library]`

In [None]:
import math
print(math.sqrt(2))
import random
print(random.random())

## Errors
Until now, all the code that we have executed have run correctly, but when we do more complex things, we can get execution errors. Now we are going to run code with errors to see what kind of errors can we found and how they look

In [None]:
# Name error
# This error is usually caused by syntax errors
print(var33)

In [None]:
# Syntax error
# Usually related with the variable types
print("hello " + 3)

In [None]:
# ZeroDivisionError
print(1 / 0)

In [None]:
# IndexError
# This happens when we try to access to a index out of bounds or an undefined index
a = [1, 2]
print(a[4])

## Classes
As Java and other object oriented languages, Python can use classes. These classes can be instantiated to create object of that class.

https://docs.python.org/3/tutorial/classes.html

In [None]:
class Complex:
    # The __init__ method is the constructor of the class
    def __init__(self, realpart, imagpart):
        self.r = realpart
        self.i = imagpart
    # All methods should declare first a parameter named self, which is the object itself
    def mod(self):
        return math.sqrt(self.r**2+self.i**2)

c = Complex(1,2)
print(c.r,c.i)
print(c.mod())

## Problems
### Problem 1
Design a method to calculate the cosine distance between two lists of floats ``x`` and ``y``. The cosine distance or cosine similarity is a measure of the angle between two vectors https://en.wikipedia.org/wiki/Cosine_similarity.

$D =1-\cos(\theta) = 1-{\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = 1-\frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} } $

In [None]:
# Use this cell to write your code
def cos_distance( x, y ):
    return

In [None]:
# Try it
print(cos_distance([1,2,3], [3,2,1]))

### Problem 2
Design a method that creates a dictionary with ``n`` vectors identified with keys ``v1, v2, ... vn``. The vectors should have dimension ``m`` and their components should be random numbers between 0 and 1 (these numbers can be generated with ``random.random()``)

In [None]:
# Use this cell to write your code
def generate_vectors( n, m ):
    return 

In [None]:
# Try it
print(generate_vectors(5, 3))

### Problem 3
Design a method that receives a dictionary of vectors and returns a matrix (list of lists) with the distance between each pair of vectors.

```python
vectors = {"v1": [], "v2": [], "v3": []}

distances = [[cos_distance(v1,v1), cos_distance(v1,v2), cos_distance(v1,v3)],
             [cos_distance(v2,v1), cos_distance(v2,v2), cos_distance(v2,v3)],
             [cos_distance(v3,v1), cos_distance(v3,v2), cos_distance(v3,v3)]]
```

In [None]:
# Use this cell to write your code
def distance_matrix( vectors ):
    return 

In [None]:
# Try it
print(distance_matrix( {"v1": [1,2,3], "v2": [2,3,1], "v3": [3,1,2]} ))

### Problem 4
Design a method that receives a matrix of numbers and return the minimum value

In [None]:
# Use this cell to write your code
def get_min( distances ):
    return

In [None]:
# Try it
print(get_min([[.1,.2,.3],[.4,.5,.6],[.7,.8,.9]]))

In [None]:
# All together!

# Generate 10 vectors of dimension 10
vectors = generate_vectors(10, 10)

# Calculate the distance between all vectors
distances = distance_matrix(vectors)

# Get the minimum distance
min_dis = get_min(distances)

print(min_dis)

___
# Pandas
## Python Data Analysis Library
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
https://pandas.pydata.org/

The basic data structure of Pandas is the dataframe. A dataframe is a collection of tabulated data, similar to a SQL table. Dataframes must always have an index column.

The introductory guide of pandas covers more contents than those which we are going to use in this case, but it is recommended to give a quick view to it: https://pandas.pydata.org/docs/getting_started/overview.html

__NumPy__ (http://www.numpy.org/) is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

__Important functions__
Some cells contains methods of functions that are going to be used in the next workshops. Those cells are marked with a comment like this 💥💥💥

In [None]:
import pandas as pd
import numpy as np

### Load data
We can load data into dataframes from different sources
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [None]:
# 💥💥💥
trucks_data = [['9073 YGP', 130, 150, 200,  300],
    ['3881 KCC', 130, 200, 245,  400],
    ['1845 GDS', 130, 245, 250,  600],
    ['8725 MHH', 130, 270, 245, 1300]]
trucks = pd.DataFrame(trucks_data, columns = ['plate', 'max_cargo', 'height', 'width', 'length'])
customers = pd.read_json("data/customers.json")
orders = pd.read_csv("data/orders.csv")
packages = pd.read_csv("data/packages.zip")

### Accesing the dataframe
Evaluating a DataFrame variable its content will be displayed in table format

In [None]:
trucks

In [None]:
customers

In [None]:
orders

In [None]:
packages

### Index
By default the index of a dataframe is a column with incrementing integers. 

In [None]:
# Accessing the rows with index within the specified range
orders[1:3]

In [None]:
# To get a single row we must use the method iloc
orders.iloc[1]

In [None]:
# Getting the index column as a list
orders.index

### Access to data column
We can retrieve the columns data as if the dataframe was a dictionary

In [None]:
# 💥💥💥
# Get a single column
customers["province"]
# Or
customers.province

In [None]:
# Get multiple columns (this returns a dataframe with the selected columns)
customers[["lat","lng"]]

In [None]:
# 💥💥💥
# Get unique values from a column
customers["province"].unique()

### Describe data
When we evaluate a dataframe, pandas will show a table with it contents. If there are many rows, it only shows the beginning and the end.

In [None]:
packages

We can use the method describe to know more about one column or the whole dataset. If the column has numerical values it will calculate some statistical measures like the average, the count, max or min. In the other hand, if the values are strings it will count the unique values and return the most repeated one.

In [None]:
# 💥💥💥
packages.describe()

In [None]:
packages["height"].describe()

In [None]:
# 💥💥💥
# List columns and their types
customers.dtypes

In [None]:
# 💥💥💥
# info() prints information about the dataframe including the index dtype and columns, non-null values and memory usage 
orders.info()

### Searching data
We will need to get the rows that matches some conditions, we will do that with the loc function

In [None]:
# 💥💥💥
# Searching all packages smaller than 100cm
packages.loc[packages["height"]<100]

In [None]:
# Searching for customers name contains the string 'Ruiz'
customers.loc[customers["name"].str.contains("S.A.")]

### Filtering
Some times we will need to filer our data: discard empty values, keep only variables in a range, etc

In [None]:
# We can perform boolean operations with columns that returns a column of boolean values 
# These boolean columns are called masks
packages["width"]<120

In [None]:
# We will use those columns to filter our data
packages_filtered = packages[packages["width"]<120]

In [None]:
# To filter by the contents of a strings column
customers[customers.province.str.contains("ia")]

### Modifing data in a column

In [None]:
# 💥💥💥
# We can apply a method to each value in a column
def to_inches(x):
    return x * 0.393701
print("Original")
print(packages["width"])
print("Modified")
# Apply and the name of the method we want to execute over each value
print(packages["width"].apply(to_inches))

In [None]:
# ADVANCED
# We can do the same by using lambdas
print(packages["width"].apply(lambda x: x * 0.393701))

### Creating new columns

In [None]:
def is_big(x):
    if x > 150:
        return True
    else:
        return False
packages["big"] = packages.height.apply(is_big)

In [None]:
# 💥💥💥
# We can use apply method over the row with the parameter "axis=1"
def volume(x):
    return x.height * x.length * x.width
packages["volume"] = packages.apply(volume, axis=1)

# Rename columns

In [None]:
# The parameter columns is a dictionary with the old names as keys and new names as values

# Most of the methods that modifies the dataframes do not change the dataframe itself, 
# instead they return a copy of it with the modification applied
orders.rename(columns={"VAT_number":"client_VAT"})

### Group by
We can execute group by orders similar to SQL.

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [None]:
# 💥💥💥
# Group by publication name and counting the number of values for each column
packages.groupby(packages["order_id"]).count()
# There are many predefined aggregating functions like: first,last, median, sum, mean, max, min, etc

### Join dataframes
There are methods to perform joins of two or more dataframes, the basic one (we only need this one in these workshops) is the method `merge`.

Just for curisosity the guide of all join types can be glanced thorugh: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [None]:
# 💥💥💥
# Merge both dataframes
data = orders.merge(packages, on="order_id").merge(customers, on="VAT_number")

data

### Save data
Saving a dataframe is quite similar to load data, we will use the methods ``dataframe.to_csv(filename)`` or ``dataframe.to_json(filename)``

In [None]:
data.to_json("data/data.json")

## Problems
### Problem 1
Calculate the number of trucks and customers

### Problem 2
Count the number of different provinces that appear in the customers dataframe

### Problem 3
Plot the top10 heaviest packages (tip: you can use the method `sort_values` https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

___
# Matplotlib and Seaborn

__Matplotlib__ (https://matplotlib.org/) is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

__Seaborn__ (https://seaborn.pydata.org/) is a Python library built on top of Matplotlib which offers an easy interface to create plots using dataframes.

In [None]:
# To import these libraries we use these lines
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# 💥💥💥
# To create a new figure, we are going to use always the next steps
# Note that the method returns 2 values, fig and ax, this can be usually found in python
fig, ax = plt.subplots(nrows=2,ncols=2,figsize=(7,5))
# The three parameters, nrows, ncols and figsize are optionaly defined
# We can define a matrix of subplots by changing the values of nrows and ncols
# We can modify the size of the matrix with the arguments of figsize
# If we don't specify this parameter, the figure will have the minimum size to fit all the drawings

# Here we will define the charts, labels, grid, etc

# Finally we should tell matplotlib to show the figure we have defined
plt.show(fig)

In [None]:
# Variable ax is a numpy array with all the subplots that we define
# We can access to each subplot by its coordinates
ax[0,0] # This is the same as ax[0][0]

In [None]:
fig, ax = plt.subplots(nrows=2,ncols=2,figsize=(7,5))

# Sample function
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

# To plot a line in a certain subplot we can use
ax[0,0].plot(t,s) # t are the horizontal values and s the vertical ones
ax[1,0].plot(s,'g') # If we define only one series it will be taken as the vertical values
# We can define de color of a plot after the values of the line

plt.show(fig)

## Plots with Matplotlib
There are a huge number of different plots available in matplotlib, we are going to see the most basic ones
https://matplotlib.org/api/axes_api.html#plotting

### Line plot
https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.plot.html?highlight=plot#matplotlib.axes.Axes.plot

We can define a string after de data to modify the style of the line which will be plotted. This strings are defined in this way ``'[color][marker][line style]``, for example ``'b--'`` is a dashed blue line or ``'ro'`` is a sequence of red circles.


In [None]:
fig, ax = plt.subplots(figsize=(7,5))

# Plot
ax.plot(t,s) 
ax.plot(t,2*s,'g--') # We can plot several lines in the same chart
ax.plot(t,4*s,'r,')

plt.show(fig)

### Bar plot
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

# Sample data about programing language popularity
x = ["Python", "Java", "Javascript", "C#", "PHP"]
y = [25.13,21.98,8.35,7.5,7.36]

# Plot
ax.bar(x,y) 

plt.show(fig)

### Scatter plot
https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html#matplotlib.axes.Axes.scatter

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

# Sample data
x = np.arange(0, 10, .1)
y = (x+np.random.rand(100))**2

# Plot
ax.scatter(x,y) 
ax.scatter(x,-y)

plt.show(fig)

### Histogram plot
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib.pyplot.hist

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

# Sample data
marks = [5.9, 7.9, 7.4, 6.2, 5.7, 8.3, 6  , 6.4, 8.1, 7.1, 5.7, 6.8, 6.2, 6.9, 7.2, 6.8, 7.3, 8.3, 5.3, 7.2, 5.8, 
         5.3, 6.8, 6.9, 5.5, 5.6, 7.7, 8.4, 6.5, 4.4, 6.8, 9.3, 5.7, 9.5, 7.1, 5.9, 8.2, 9  , 6.6, 6.8, 4.9, 6.9, 
         6.2, 6.8, 9.1, 5.8, 7.3, 4.7, 7.4, 3.1, 8.5, 7.9, 5.8, 7.9, 5.1, 5.2, 7.8, 6.3, 6.5, 5.3, 7.5, 6.8, 6.6, 
         6.7, 7.8, 7.6, 10 , 5.8, 8.1, 7.8, 8.5, 5.4, 8.1, 3.6, 6  , 8  , 6.1, 4.9, 6.3, 5.2, 7.3, 7  , 6.7, 5.9, 
         4.2, 5.2, 8.5, 9.2, 7.1, 8.7, 6.6, 8  , 6.9, 5  , 5.9, 8.1, 7  , 8.2, 7.7, 4.2]

# Plot
ax.hist(marks)

plt.show(fig)

## Axes modifications and labels

If we do not configure anything, matplotlib will generate the axes scales just big enough to fit the data. But in sometimes we may want to define a certain range (eg. if we are plotting exams marks, we may want to plot form 0 to 10 in the horizontal axe) or we may want to use a different precission.

In addition, is a good practice to include labels in both axes and a title over the chart

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

ax.hist(marks)

# We can specify the ranges of bot axes
ax.set_xlim(0, 10)
ax.set_ylim(0, 30)

# We can also specify the divisions of the axes
ax.set_yticks(np.arange(0,32,2))

# Activate the grid
ax.grid()

# Set a title
ax.set_title("Histogram of last exam's marks")

# Set axes labels
ax.set_xlabel("Grade")
ax.set_ylabel("Number of students")

plt.show(fig)

## Exporting images
We can export our images to use them anywhere. It is always suggested to use vector formats (like svg or eps) if it is possible.

In [None]:
# We specify the file name and the format
# bbox_inches='tight' removes the margins in the image
fig.savefig("marksHistogram.pdf", bbox_inches='tight')

## Plotting with Seaborn
Seaborn can be used in a similar way to matplotlib, by creating a figure and then plotting something on it.

In [None]:
# 💥💥💥
fig, ax = plt.subplots(figsize=(7,5))

# The parameter 'data' defines the dataframe used to generate the chart
# We need to indicate which columns will be used in each axis. 'x' and 'y' in this case
# To use the figure we will set the parameter 'ax' equal to the axis we have just defined
sns.scatterplot(data=customers, x="lat",y="lng", ax=ax)

## Categorical plots

In [None]:
fig, ax = plt.subplots(figsize=(15,7))

# countplot displays a bar chart with the sample's count of each category
sns.countplot(data=orders, y='VAT_number', ax=ax)

## Distribution plots

In [None]:
data

In [None]:
fig, ax = plt.subplots(nrows=2, figsize=(14,12))

# histplot creates an histogram of the indicated column
sns.histplot(data=data, x="volume", ax=ax[0], bins=50)

# boxplots are used to 
# With the 'hue' parameter
sns.boxplot(data=data, y="province", x="weight", ax=ax[1], hue="big")

## Relation plots

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(14,6))

# Regplot 
sns.regplot(x="width",y="volume",data=packages, ax=ax[0])

# 
sns.scatterplot(data=packages, x="width", y="height", size="volume", ax=ax[1])

## Problems

### Problem 1
Generate a list of vectors with the method `generate_vectors` that we previously coded. Plot one chart for each vector; each of those charts should be a bar chart with the distances to the other vectors (tip: use the method `distance_matrix`)

In [None]:
# Use this cell to write your code

### Problem 2
Combine two kinds of charts (line, bars, scatter, etc) in one plot

In [None]:
# Use this cell to write your code