## Pandas

- Comprehensive Python library for data manipulation and analysis, in particular tables and time series
- Pandas **data frames** = tables
- Supports interaction with SQL, CSV, JSON, ...
- Integrates with Jupyter, numpy, matplotlib, ...


### Reading CSV

In [1]:
import pandas as pd
students = pd.read_csv('students.csv')
students

Unnamed: 0,name,nationality,city,latitude,longitude,gender,age,english.grade,math.grade,sciences.grade,language.grade,portfolio.rating,coverletter.rating,refletter.rating
0,Kiana Lor,China,Suzhou,31.31,120.62,F,22,3.5,3.7,3.1,1.0,4,4.0,4
1,Joshua Lonaker,United States of America,Santa Clarita,34.39,-118.54,M,22,2.9,3.2,3.6,5.0,5,4.0,5
2,Dakota Blanco,United States of America,Oakland,37.80,-122.27,F,22,3.9,3.8,3.2,5.0,3,3.0,4
3,Natasha Yarusso,United States of America,Castro Valley,37.69,-122.09,F,20,3.3,2.8,3.2,5.0,5,2.0,4
4,Brooke Cazares,Brazil,São José dos Campos,-23.18,-45.88,F,21,3.7,2.6,3.4,1.0,4,4.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
302,Austin Haas,United States of America,Columbus,39.96,-83.00,M,20,3.6,3.7,3.1,5.0,4,5.0,5
303,Madison Fithian,United States of America,Los Angeles,34.05,-118.24,F,20,3.6,3.9,4.0,5.0,5,5.0,3
304,Zachary Mulvahill,United States of America,Los Angeles,34.05,-118.24,M,20,3.2,3.4,3.9,5.0,5,5.0,3
305,Eliana Michelsen,United States of America,Oakland,37.80,-122.27,F,23,3.0,2.8,2.9,5.0,4,4.0,5


### Indexing data frames

In [None]:
students['name']  # single column

In [None]:
students.name

In [None]:
students[['name', 'nationality']]    # select multiple columns

In [None]:
students.head(2)  # only display two top entries

In [None]:
students.tail(1)    # only display last entry

In [None]:
students[1:3]   # row slicing by row label

In [None]:
students[::2]   # every second row

In [None]:
students.at[2, 'nationality'] # use at to lookup single cell

In [None]:
students.loc[1] # single row is accessed using .loc[row label]

In [None]:
students.loc[1]['name']  # another way to get a single value

In [None]:
students.loc[[1, 3], ['nationality', 'name']]  # extract sub data frame

In [None]:
students['nationality'].unique()   # check unique entries of a column

In [None]:
students[students.nationality == 'Mexico'] # create a mask, similar to WHERE in SQL

In [None]:
# You can also do operations on columns +, -, &, /, ....
students[(students.nationality == 'Mexico') & (students.age > 22)]

### Creating data frames from data

In [None]:
pd.DataFrame([10, 11, 12, 13]) # one dimensional data

In [None]:
# data frame using dictionary
pd.DataFrame({'A': [1, 2, 3], 'B': ['one', 'two', 'three']})  


In [None]:
pd.DataFrame([[10, 11], [12, 13]])  # two dimensional list

This looks like a matrix! Both data frames and matrices are 2-dimensional data structures. In general, data frames can include **multiple** types of data (numeric, character, factor, etc) while matrices can only store **one** type of data.

### Data frame in Python

In Python, a data frame is a two-dimensional, tabular, mutable data structure that may store tabular data containing objects of various data types. A data frame has axes that are labeled in the form of rows and columns. data frames are useful tools in data preprocessing because they provide valuable data handling methods.

### Applications of data frames
![image.png](attachment:image.png)

### Matrix in Python
Matrix is a homogeneous collection of data sets organised in a two-dimensional rectangular grid. It’s an m*n array with the same data type. It is created with a vector input. There are a fixed number of rows and columns. Python supports numerous arithmetic operations such as addition, subtraction, multiplication, and division on Matrix.

In [None]:
import numpy as np
A = np.random.random((3, 4))
A

Why is `random` used twice?
- `np. random`: The first `random` refers to the module within NumPy that contains all the functions related to random number generation. It's the name of the module, and it's accessed via `np.random`.
- `random((5,7))`: The second `random` is the actual function that generates random numbers. This function is called with the argument `(5,7)`, specifying the dimensions of the output array.

In [None]:
pd.DataFrame(A)

In [None]:
R = pd.DataFrame(A, columns=list('ABCD'), index=list('xyz'))
R

### Matrix vs Data Frame
![image.png](attachment:image.png)

### Data Frames summary

A **DataFrame** in Pandas is a flexible data structure designed for data manipulation and analysis in Python. It organizes data in a 2D tabular format, where rows represent individual records, and columns represent variables or features. 

**Similarities to SQL Tables:**
- **Tabular Structure**: Both DataFrames and SQL tables organize data in a similar row-and-column format.
- **Schema**: Just like SQL tables, DataFrames have a defined schema, with each column having a specific data type.
- **Operations**: You can perform SQL-like operations on DataFrames, such as filtering, grouping, joining, and aggregating data.

**Differences:**
- **In-Memory vs. Persistent Storage**: DataFrames are in-memory structures, meaning they exist in the computer's memory and are lost when the session ends. SQL tables, however, are stored persistently in a database.
- **Flexibility**: DataFrames offer more flexibility for data manipulation with Python, allowing operations that would be more complex in SQL, like applying custom functions to columns or rows.
- **Scalability**: SQL databases are designed for handling large datasets and concurrent users, whereas DataFrames are better suited for in-memory analysis of smaller to moderately large datasets.

In essence, DataFrames combine the familiarity of SQL tables with the cool stuff you get from Python, making them the s!#t for Machine Learning.

# Example application

Here we will go through the steps of a simple machine learning application based on the Titanic data set

## Load the data and print a bit of information

In [None]:
import pandas as pd
import numpy as np
titanic = pd.read_csv("titanic.csv")

In [None]:
titanic

Before we can start, we need to handle a few issues: There are several "NaN"-values in the data set, which our machine learning algorithms won't be able to handle. It can't handle non-numeric values either, which we will deal with in two different ways: we change the "sex"-feature to a dummy variable and remove all other non-numeric columns. 

In [None]:
#Dropping unwanted columns:
titanic = titanic.drop(['Name',"Ticket","Cabin","Embarked"], axis='columns')

#Creating dummy variable for sex:
titanic["Sex"] = pd.get_dummies(titanic["Sex"])["male"]

#We will also drop all rows that contain NaN-values:
titanic = titanic.dropna()
titanic

We want to try to predict whether a given passenger survived or not. So "survived" is our target variable (i.e. our labels).

In [None]:
labels = titanic["Survived"]
labels

In [None]:
data = titanic.drop(['Survived'], axis='columns')
data

## Building a model

For illustrating the process we use the naïve Bayes model (more on this later). In scikit-learn all ML algorithms are implemented in their own class (for naïve Bayes it is `GaussianNB` under `sklearn.naive_bayes`) that should be instantiated. 

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

# It is extremely easy to try with a different machine learning algorithm instead! 
# Just uncomment the two lines below to use a "k Nearest Neighbors"-model instead of Naive Bayes:

# from sklearn.neighbors import KNeighborsClassifier 
# model = KNeighborsClassifier()

Use the `fit` method for learning the model; this function takes as arguments the training data and the corresponding labels 

In [None]:
model.fit(data, labels)

# f(x) = a_1 * x_1 + a_2 * x_2 + .....


## Making predictions

We use the learned model to make predictions about new data instances for which we do not know the labels.
Let's see if a person with PassengerId 100, class 3, female, age 20, no siblings, parch 0 and a fare of 30 is predicted to survive:

In [None]:
# New data organized in a two-dimensional array 
x_new = np.array([[100, 1, 1, 10, 1, 2, 55]])

# Create a dataframe
feature_names = data.columns.tolist()
x_new_df = pd.DataFrame(x_new, columns=feature_names)

# Make the prediction
predict = model.predict(x_new_df)
print("Prediction: {}".format(predict))

Lastly, let's see for how many of the passengers it was correctly predicted whether they survived or not:

In [None]:
print("Accuracy score: {}".format(model.score(data, labels)))

## Tuning HyperParameters

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Rename the data
X = data
y = labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Try to remove random_state

# Define the range of hyperparameters to test
neighbors_range = range(3, 20)  # Number of neighbors from 3 to 19
distance_metrics = ['euclidean', 'manhattan', ('minkowski')]  # Common distance metrics, only illustration, doesn't make sense!

# Variables to store the best parameters and highest accuracy
best_accuracy = 0
best_params = {'n_neighbors': None, 'metric': None}

# Nested loop to iterate over the hyperparameters
for n_neighbors in neighbors_range:
    for metric in distance_metrics:
        # Initialize the KNN model with current hyperparameters
        knn = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)

        # Train the model
        knn.fit(X_train, y_train)

        # Make predictions on the test set
        predictions = knn.predict(X_test)

        # Calculate the accuracy
        accuracy = accuracy_score(y_test, predictions)

        # Update the best parameters if current accuracy is higher
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params['n_neighbors'] = n_neighbors
            best_params['metric'] = metric

# Print the best set of parameters and the highest accuracy achieved
print(f"Best parameters: Number of Neighbors - {best_params['n_neighbors']}, Distance Metric - {best_params['metric']}")
print(f"Highest Accuracy: {best_accuracy*100:.2f}%")
