# D. E. U. en Inteligencia Artificial de Samsung Innovation Campus (Theory)

- Cèsar Ferri, `cferri@dsic.upv.es` (Introducción).
- Javier Palanca, `jpalanca@dsic.upv.es` y Salvador España, `sespana@dsic.upv.es` (Numpy, Pandas).
- **`Mayús` + `Tab`**: shows the help of the function, in addition to the `help()` function.
- **`Esc` -> `M`**: changes the cell to Markdown (**Y** to change to Code).
- **`Tab` MUST BE USED**.

# Chapter 1. Introduction

- 

# Chapter 3. Numpy and Pandas (Optimized Numerical Computation)

## Unit 1. Numpy Array Data Structure for Optimal Computational Performance

- Memory is **expensive**: Numpy is implemented in C for maximum **optimization**.
- Check the **NumPy Documentation**:
  - [[General Functions]](https://numpy.org/doc/stable/reference/index.html).
- Data and algorithms: **finding data** is important, it can be speedy but at a **memory cost**. In addition, **allocating data** increases the difficulty of the problem.
- **Numpy**: implemented on C, uses **element-by-element operations**.
- **Array DataType**: created as `np.array([n, n, ...])`, books the memory, "learns" that it is an *Integer* array. Identical objects will have different locations in memory. Comparing two-array objects returns a new array with a comparison of every single object one by one; because of **element-by-element operations**. `ndarray` means `n-dimensional array`. Uses `32-bit integers`. Do not use dictionaries or sets inside an array. Use array to implement lists made of lists as matrices. Runs through a matrix through rows. It has **type inferring** which works very well to **prevent memory waste**.
- **Copying an array**: copying but can define the way in which the copy will be **allocated in memory** (columns, rows, ...). **Making copies uses more memory and gives more cost than creating a new one**. So, only use copy when you need to modify the previous object and still retain the original. Do **not use copies when you are not modifying** the object.
- **Memory usage**: if there is a string array, when a long list made of small words has a very long word, every *box* will be as big as the longest word, therefore **wasting memory space**. Also the **data type can create waste (such as saving numbers as strings)**. It takes a lot to do a `reshape` than to `create` an array.
- Function `np.array(lista)`: turns list into an array.
- Function `np.zeros(n)`, `np.ones(n)`: returns an array built with zeros.
- Function `arr.shape`: returns a matrix-like definition of the object, indicating the number of elements in rows and columns.
- Function `arr.dtype`: returns the inferred type of the object or `object` when it cannot identify, such as dictionaries, big numbers, sets, ...
- Function `np.arange(n)`: similar to range, start, stop, step, datatype, ... Difference: it **does allocate the memory selected** and can use decimals.
- Function `len(array)`: returns length of the object (if you were to loop it, **how many times in loop you would need**), **the number of rows**.
- Function `array.size`: returns the number of elements (`columns x rows x depth`).
- Function `array.itemsize`: returns byte size of **one element**.
- Function `array.nbytes`: returns all bytes occupied by the whole array (`size x itemsize`).
- Function `array.ndim`: returns the number of **dimensions** of the object.
- Function `ndarray.view(encoding)`: shows a matrix as if it had a different encoding, **but it is not the same as actually having the data with the different encoding** (sometimes can be even useless).
- Function `np.linspace(start, stop, n_elements)`: returns an array of n_numbers equally separated from one another in `linear` scale.
- Function `np.geomspace`: returns an array of n_numbers equally separated from one another in **logarithmic** scale.
- Function `np.logspace`: returns an array of n_numbers equally separated from one another in **powers of ten in linear scale**.
- Numpy array `indexing`: similar to `List Datatype`, with slicing and all. **Be careful when manipulating arrays, as slicing DOES NOT CREATE A COPY such as with List Datatype**. Be careful also using `ndarray.view(encoding)` function. **Matrix manipulation**: **WILL PROBABLY AFFECT MEMORY ALLOCATION**. In addition, when deleting a big matrix, it might be interesting to create new matrices of the smaller matrices that share memory with it, as **MEMORY WILL NOT BE FREED UNTIL THE WHOLE CHUNK IS COMPLETELY FREE** (check `np.copy()`).
- Function `array.item()`: allows access to an `array-object`, such as a dictionary inside an array.
- Function `array.astype(type)`: receives an array and changes its type.
- Function `array.reshape()`: returns an array with a different shape. **Does not modify original**.
- Function `np.shares_memory(element_1, element_2)`: returns bool of possible intersection within memory.
- Function `del array`: deletes an array.
- Function `np.append(array, element, axis)`: (**does not modify the original, ASSIGN TO NEW ONE**) adds elements to array. **Beware of the dimensions of the array**. The axis can be defined, so that addition is done to `rows (0)` or `columns (1)`. As the creation of an array is first rows and then columns, adding a new row is just appending an array at the end, whereas adding a column means adding **something to every each column**.
- Function `np.delete(array, element, axis)`: deletes an element of the array. **Incorrectly modifying a matrix may result in an array return**.
- `Numpy operators`: are made **element-to-element** arithmetic operations, such as `+`, `-`, `*`, `/`, `**` return a normal output and logical operators return `booleans`.
- Function `np.repeat()`.
- Function `tile()`.
- Function `np.sqrt(array)`: returns the square root of the given numbers.
- Function `np.random.seed(n)`: establishes the seed of the semi-random number generator.
- Function `np.random.randint(start=0, stop, size)`: returns an array with the random numbers from start to stop of given size, `which can be a matrix`.
- Function `np.unique()`.
- Module `np.linalg`.
- Function `np.linalg.det`: returns the determinant of the given matrix.
- Function `array.transpose(param)`, `array.T`, `np.transpose(array)`: transposes a matrix with a `dimensional parameter`. The given parameter `exchanges the dimensions of the matrix`.
- Function `array.swapaxes(params)`: exchanges 2 given parameters.


In [90]:
import numpy as np

print(np.__version__)
a1 = np.array([0, 1, 2, 3, 4, 5])
print(a1, type(a1), print(id(a1)))
a2 = np.array(range(6))
print(a2, type(a2), print(id(a2)))
print(a1 == a2)
a3 = np.copy(a1, order='K')
print(a3, a3[0], type(a3[0]))
a4 = np.array({'one': 1, 'two': 2, 'three': 3})
print(a4, repr(a4))
print(a1.shape, a4.shape)                                           # Returns a tuple with a matrix-like information
print(a1.dtype, repr(a2.dtype), repr(a4.dtype), end = '\n\n')

a5 = np.ones(6, dtype='int32')
a6 = np.ones(6)
print(a5, a6, end = '\n\n')

m1 = np.array([[1, 2, 3], [4, 5, 6]])
print(f'This array has {m1.ndim} dimensions')
print(m1, m1.shape)                                                 # Length is the first dimension
m2 = np.arange(15)
m3 = np.copy(m2)
m3.shape=(3, 5)
print(m2, m3, sep='\n', end = '\n\n')

a7 = np.array([111, 2.3, 'hi'])
a8 = np.array(['111', '2.3', 'hi'])
print(a7.nbytes, a8.nbytes, end = '\n\n')                           # Clear example that saving numbers as strings can waste a ton of space

# Example of how a lot of memory can be wasted within the "data type" selection
print(np.array(['uno', 'dos', 'tres'], dtype='>U4').view('uint8'))
print(np.array(['uno', 'dos', 'tres'], dtype='|S4').view('uint8'), end = '\n\n')

print(np.linspace(0, 10, num=5))
print(np.geomspace(1, 10000, num=5))
print(np.logspace(0, 3, num=4), end = '\n\n')

a9 = np.arange(15).reshape(3, 5)
print(a9, end = '\n\n')

a10 = np.array([10, 20, 30])
a11 = a10[:]
print(np.shares_memory(a10, a11), end = '\n\n')

a12 = np.arange(70).reshape(7, -1)     # Asks for a reshape in 7 rows and whatever columns
print(a12)
print(a12.reshape(5, -1), end = '\n\n')

a13 = np.array([[1, 2, 3], [4, 5, 6]])
print(a13)
np.append(a13, [[7, 8, 9]], axis=0)
print(a13)

a14 = np.arange(9)
a14.reshape(3, 3)
print(a14)

1.26.4
2224047651312
[0 1 2 3 4 5] <class 'numpy.ndarray'> None
2224060495600
[0 1 2 3 4 5] <class 'numpy.ndarray'> None
[ True  True  True  True  True  True]
[0 1 2 3 4 5] 0 <class 'numpy.int32'>
{'one': 1, 'two': 2, 'three': 3} array({'one': 1, 'two': 2, 'three': 3}, dtype=object)
(6,) ()
int32 dtype('int32') dtype('O')

[1 1 1 1 1 1] [1. 1. 1. 1. 1. 1.]

This array has 2 dimensions
[[1 2 3]
 [4 5 6]] (2, 3)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

384 36

[  0   0   0 117   0   0   0 110   0   0   0 111   0   0   0   0   0   0
   0 100   0   0   0 111   0   0   0 115   0   0   0   0   0   0   0 116
   0   0   0 114   0   0   0 101   0   0   0 115]
[117 110 111   0 100 111 115   0 116 114 101 115]

[ 0.   2.5  5.   7.5 10. ]
[1.e+00 1.e+01 1.e+02 1.e+03 1.e+04]
[   1.   10.  100. 1000.]

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

True

[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 2

## Unit 2. Optimal Data Exploration Through Pandas

### 2. 1. Series

- Library **Pandas**: tabular data manipulation library (most extended).
- Check the `**Pandas Documentation**:
  - [[General Functions]](https://pandas.pydata.org/docs/reference/general_functions.html).
  - [[Dataframe Functions]](https://pandas.pydata.org/docs/reference/frame.html).
- `Series`: similar to Numpy array (`1 datatype` and `1 dimension`), with added `title`.
- `DataFrame`: a group of *series*, with different datatypes.
- Parameter `inplace=Bool`: to modify the original.
- Function `pd.Series(list)`: creates a pandas series from iterable data, but **also dictionaries** (where indexes are the keys of the dictionary).
- Function `ser.index`: returns a **compressed version of the series index (WHENEVER IT CAN BEST REDUCE IT, such as ranges)**.
- Function `ser.values`: returns the values of the series.
- Function `ser.dtype`: returns the datatype property.
- Function `pd.DataFrame(object)`: creates a data frame with the information of the given object.
- Library `from IPython.display import display`: to be able to see several data frames at the same time.
- Function `display(dataframe)`: displays a data frame.
- Function `df.columns`: returns the `index` of the columns, similar to `ser.index`.
- Function `ser.name`: returns the name of the series, can be modified.
- Function `ser.sort_values`: returns a series sorted by values.
- Function `ser.sort_index`: returns a series sorted by index values.
- Function `ser.unique()`: returns the unique values of a series.
- Function `ser.nunique()`: returns the number of unique values.
- Function `ser.value_counts()`: counts all the equal values.
- `Pandas operations`: are made **to values, element-to-element** and involve all classical arithmetical operations.
- Functions `ser.sum()`, `ser.mean()`, `ser.median()`: return a value of the `sum`, `mean` and `median` of all series values.
- Function `ser.apply(lambda function)`: applies a lambda function to the series.


In [180]:
import pandas as pd
import numpy as np
from IPython.display import display

print(pd.Series(['Male', 'Female', 'Male', 'Male', 'Female', 'Female']))
ser = pd.Series({'a': 1, 'b': 2, 'c': 3})
print(ser)
print(ser.index)                                         # Returns a reduced version of thseries index (whenever it can)
ser2 = pd.Series(np.arange(30))
print(ser2.index, end = "\n\n")                          # Returns the compressed version of the index

list_data = ['2019-08-10', 3.14, 'ABC', 100, True]
ser3 = pd.Series(list_data)                              
print(ser3, end = "\n\n")

d = {
    'Name': ['Brown', 'Henry', 'Elizabeth'],
    'Age': [22, 35, 58],
    'Sex': ['Male', 'Male', 'Female']
}
df1 = pd.DataFrame(d)
display(df1)
print(df1['Age'])                                        # Access the series of a data frame
print(df1['Age'][0])                                     # Access an element of the series of a data frame
# Indexes can be manipulated  but with all-dictionaries inside the dictionary
d2 = {
    'Name': {0: 'Brown', 1: 'Henry', 2: 'Elizabeth'},
    'Age': {0: 22, 2: 35, 1: 58},
    'Sex': {0: 'Male', 2: 'Male', 1: 'Female'}
}
df2 = pd.DataFrame(d2)
display(df2)

df3 = pd.DataFrame(np.arange(1, 10).reshape(3, 3))
display(df3)

my_data = [220, 13, 23, 34, 234]
ser5 = pd.Series(my_data, index = ['A', 'B', 'A', 'A', 'C'])
ser5.sort_values(inplace = True)
display(pd.DataFrame(ser5))

ser6 = pd.Series(np.arange(0, 50, step = 10), index = ['a', 'b', 'c', 'd', 'e'])
print(ser6)
print(ser6['a':'d'], end = "\n\n")                       # Slicing works fine, but includes the last one

ser_height = pd.Series([160, 170, 180], name = "height")
print(ser_height.apply(lambda x: x + 10), end = "\n\n")

0      Male
1    Female
2      Male
3      Male
4    Female
5    Female
dtype: object
a    1
b    2
c    3
dtype: int64
Index(['a', 'b', 'c'], dtype='object')
RangeIndex(start=0, stop=30, step=1)

0    2019-08-10
1          3.14
2           ABC
3           100
4          True
dtype: object



Unnamed: 0,Name,Age,Sex
0,Brown,22,Male
1,Henry,35,Male
2,Elizabeth,58,Female


0    22
1    35
2    58
Name: Age, dtype: int64
22


Unnamed: 0,Name,Age,Sex
0,Brown,22,Male
1,Henry,58,Female
2,Elizabeth,35,Male


Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


Unnamed: 0,0
B,13
A,23
A,34
A,220
C,234


a     0
b    10
c    20
d    30
e    40
dtype: int32
a     0
b    10
c    20
d    30
dtype: int32

0    170
1    180
2    190
Name: height, dtype: int64



### 2. 2. DataFrames

- Function `pd.read_csv('name.csv')`: reads the data from a given `csv` file.
- Function `df.info()`: returns a brief summary of the given data frame.
- Function `df.head(n)`, `df.tail(n)`: returns the first or last `n` rows of a data frame.
- Function `df[[col1_name, coln_name]]`: accesses the values of the given columns.
- Function `df.columns = [list]`, `df.index = [list]`: the new given list will be the new index or column names.
- Function `df.rename(col/index = {dict}, inplace = bool)`: returns or modifies the values of the dictionary inside index or columns.
- Function `df.drop([index/col], axis = dim, inplace = bool)`: drops the chosen columns or rows.
- Function `df.loc[index]`: returns the selected row `or rows ([[]])` by given index. **GOOD TO REORDER COLUMNS**.
- Function `df.iloc[[num1, num2]]`: returns the selected row `or rows ([[]])` by given index numbers. **GOOD TO REORDER COLUMNS**.
- Function `df.loc[dim1, dim2]`, `df.iloc[dim1, dim2]`: returns the data allocated in that point. Example: `df.iloc[0, [2, 3]]`.
- `df[col]`: returns a pandas Series.
- `df[[col]]`: returns a pandas DataFrame.
- Function `df[new] = val`: column addition. Can be done using one value to all or a list with all designated values (example is one value to all). **This action is INPLACE**. **Column addition has a LOW COST**.
- Function `df.loc[out_of_range] = [val]`: row addition. Can be done using one value to all or a list with all designated values (example applies a list). **This action is INPLACE. **Row addition has a HIGHER COST, AS IT CREATES A NEW DATAFRAME**.
- Function `df.reset_index(inplace = bool)`: returns the index column as column 0. Used to **manipulate index names.
- Function `df.set_index([col])`: destroys the current index and places the new one **or ones (if a list is given)**. **Beware of losing the current index**.
- Function `df.transpose()`, `df.T`: transposes the given data frame.
- Function `df.reindex(new_index, fill_value = n)`: returns a new data frame and fills every empty spot with `NaN` or the given value.
- Function `df.sort_index(ascending = bool, inplace = bool)`: reorders the data frame.
- Pandas and DataFrame `operations`: element-to-element typical operations. **Operating with NaN or empty numbers will return a NaN**.
- Function `ser1.op(ser2, fill_value = n)`: operations like `add`, `sub`, `mul`, `div` can be performed in this way to fill with numbers.
- Function `pd.merge(df1, df2, on = common_col_name, how = y)`: **combines data frames using things in common** using the `common_col_name` (`left_on` and `right_on` can be used to select what columns we try to combine) and the `y` decision (`inner`: intersection, `left`: takes left and fills, `right`: takes right and fills, `outer`: fills everything). Looks for similar columns and merges them. If no common column, an error message will appear.
- Function `pd.concat([df1, df2], sort = bool, axis = dim)`: puts one data frame **next to each other**.
- Function `pd.sort_values(by=['chosen_col1', 'chosen_col2'])`: sorts the values by the chosen column or columns (when there are coincidences).
- Function `pd.MultiIndex.from_tuples(list_of_tuples)`: using a `zip()` function we make a list of combinations of indexes to create a multi-index.
- Function `pd.pivot_table(df, index = [], columns = [], values = 'chosen_val', fill_value = n, aggfunc = lambda/sum/...)`: makes a **multi-index DataFrame** with the given values. Indexes are chosen from outer to inner, the grouping values and the value looked for.
- Function `df.groupby(['col1', 'col2'])['sorting_col'].mean()`: returns a Series grouped by given columns, `sorted` by another column and a `reference`.
- Function `df.sum(axis = dim)`: sums all values by the given dimension.
- Function `df.mean(axis = dim, skipna)`: gives the mean from all numbers. `skipna` skips **Not a Number (`NaN`)** values.
- Function `df.describe()`: calculates several statistical values from a data frame (count, mean, std, min, max, and percentiles).
- Function `df.count(axis)`: counts elements by axis.
- Function `df.corr()`: returns a matrix with the Pearson correlation between tables (`-1.0` to `1.0` in an `N x N` matrix).
- Function `df.ser1.corrwith(df.ser2)`: returns the `Pearson` correlation between different Series of the data frame.
- Function `df.ser1.corr(df.ser2)`: returns the Series of correlation between Series.


In [16]:
import numpy as np
import pandas as pd
from IPython.display import display

my_header = ['a', 'b', 'c']
my_index_out = ['G1'] * 3 + ['G2'] * 3                          # Outer index list
print(my_index_out)
my_index_in = [1, 2, 3] * 2                                     # Inner index list
print(my_index_in)
my_index_zipped = list(zip(my_index_out, my_index_in))          # Creates a list of tuples
my_index = pd.MultiIndex.from_tuples(my_index_zipped)           # Creates a Multi-Index object
df = pd.DataFrame(data = np.random.randn(6, 3), index = my_index, columns = my_header)
display(df)
print(df.loc['G1', 1])                                          # Accesses row with index G1, 1
print(df.loc['G1', 1]['a'])                                     # Accesses row and column, therefore to value

['G1', 'G1', 'G1', 'G2', 'G2', 'G2']
[1, 2, 3, 1, 2, 3]


Unnamed: 0,Unnamed: 1,a,b,c
G1,1,-0.333985,-0.339295,-0.463194
G1,2,-0.941581,-0.282294,0.817271
G1,3,-0.251859,-0.255707,-0.048179
G2,1,-0.568643,-0.248867,0.168505
G2,2,-0.550231,-0.996165,1.5169
G2,3,0.952071,1.447052,-1.943138


a   -0.333985
b   -0.339295
c   -0.463194
Name: (G1, 1), dtype: float64
-0.333985241316585


## Unit 3. Pandas Data Preprocessing for Optimal Model Execution

- Check the **Pandas Documentation**:
  - [[General Functions]](https://pandas.pydata.org/docs/reference/general_functions.html).
  - [[Dataframe Functions]](https://pandas.pydata.org/docs/reference/frame.html).
- Library `seaborn`: used to create statistical graphs and informative plots.
- Function `df.col.value_counts(dropna=bool)`: returns a Series with the counted values of a given column, with or without `NaN`.
- Function `df.subset.notnull()`: returns the non-null values of the selected subset of the data frame.
- Function `df.subset.isnull()`: returns the null values of the selected subset of the data frame.
- Function `df.isnull().sum(axis)`: returns a Series where you can see **WHERE ALL THE NaN VALUES ARE**, by rows or columns.
- Function `df.dropna(axis, how, thresh, inplace...)`:
  - `axis`: `0` for rows, `1` for columns.
  - `how`: `any` (drop if at least one `NaN`), `all` (drop if all are `NaN`).
  - `thresh`: gives a `minimum` of non-NaN values required to keep the row/column.
- Function `df.fillna(value)`: fills the `NaN` values with the given value.
- Function `df.fill()`: (⚠️ unclear – likely a typo; perhaps you meant `fillna()` again?).
- Function `df['col'].value_counts(dropna=True).idxmax()`: returns the most frequent value; can be used to fill missing values.
- Function `df.duplicated()`: returns a Series of booleans where, by default, `True` marks rows that are exact duplicates of a previous one (i.e. keeps the first and flags the rest).

In [48]:
import pandas as pd
import numpy as np
from IPython.display import display

df = pd.read_csv('./CSV/auto_mpg.csv')
display(df.head(3))
print(df.info())
df['horsepower'] = df['horsepower'].replace('?', np.nan)
print('A =', df['horsepower'])                         # Returns a Series of the given columns, original order and duplicates, with index
print('B =', df['horsepower'].unique())                # Same but no duplicates, order of appearence, no index

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,1970,USA,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,1970,USA,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,1970,USA,plymouth satellite


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    int64  
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    object 
 8   name          392 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 27.7+ KB
None
A = 0      130
1      165
2      150
3      150
4      140
      ... 
387     86
388     52
389     84
390     79
391     82
Name: horsepower, Length: 392, dtype: int64
B = [130 165 150 140 198 220 215 225 190 170 160  95  97  85  88  46  87  90
 113 200 210 193 100 105 175 153 180 110  72  86  70  76  65  69  60  80
  54 208 155 112  92 145 137 158 167

## Unit 4. Data Visualization for Various Data Scales

# Chapter 2. 