# Workshop: Introduction to Python libraries for Data Science

This workshop will cover the basics of the Numpy, Pandas and Matplotlib libraries. Numpy is a library for working with arrays of data, while Pandas is a library for data manipulation and analysis. These libraries are commonly used in data science and machine learning projects.

They are both huge libraries, and we will only cover a small portion of their functionality. However, we will cover the most important parts, in order to know how to use them combined with other libraries, such as Matplotlib or Scikit-learn, which we will cover in future workshops.

In [2]:
import numpy as np

#### Exercise 1: Convert a list into a numpy array

The objective of this exercise is to convert a list into a numpy array. The list is given below:

In [3]:
list = [1, 2, 3, 4, 5]

#### Your Code Here ####



# list = 

#### End Code ####

# Type need to be: numpy.ndarray
type(list)


list

#### Exercise 2: How to create a numpy array

There is 6 ways to create a numpy array:
- An array of zeros
- An array of ones
- An empty array
- An array of the n first integers
- An array of random values between -1 and 1
- An array of random values between x and y

In [4]:
# An array of zeros

# An array of ones

# An empty array

# An array of the n first integers

# An array of random values between -1 and 1

# An array of random values between x and y

#### Exercise 3: The slicing of a numpy array

The objective of this exercise is to understand how to slice a numpy array.

In [5]:
array = np.random.randn(10)

# Get the first three elements

# Get the last two elements

# Get the elements from index 4 to 7

# Get all the elements with negative values

# Set all the elements with negative values to 0

# Get the elements 3, 4 and 8

#### Exercise 4: Operators on numpy arrays

The objective of this exercise is to understand how to use operators on numpy arrays.

In [6]:
array = np.random.randn(10)
print(array)

# Add 10 to all the elements

# Multiply all the elements by 2

# Get a boolean array with True for all the elements that are greater than 0


[-0.68920912  2.31272758 -0.050211    0.22517355 -0.78696911 -0.70373212
 -1.97459493  0.19567161  0.98507939 -1.1752195 ]


#### Exercise 5: Data Exploration

The objective of this exercise is to know how to explore the data.

In [7]:
array = np.random.randn(10)

# Get the mean of the array

# Get the sum of the array

# Get the maximum value of the array

# Reshape the array to a 2x5 array

# Get the max value of each row

# Count the non zero values

# Combining Pandas and Numpy

Pandas is a library that is built on top of Numpy. It is used to manipulate data in a tabular form. It is mainly used to manipulate data in a dataframe.

#### Exercise 6: Load a dataset using Pandas

In [8]:
import pandas as pd

# Read using pandas this csv : https://storage.googleapis.com/download.tensorflow.org/data/abalone_train.csv
# Set the column names to: "Length", "Diameter", "Height", "Whole weight", "Shucked weight", "Viscera weight", "Shell weight", "Age"

# dataframe = 

# Print the startings rows of the dataframe

# Print the last rows of the dataframe


#### Exercise 7: Data Exploration

The objective of this exercise is to extract information from the dataset.

In [9]:
# Print the maximum value of the "Height" column

# Print the mean value of the "Height" column

# Print the minimum value of the "Height" column

# Print all the previously mentioned values for the "Length" column in a unique line

# Print the number of rows, columns and type of each column


#### Exercise 8: Dataframe Recreating

The objective of this exercise is to recreate a dataframe from the first one.

In [10]:
# Create a new column called "Volume" with the fomula: Length * Diameter * Height * pi / 4

# Create a new colum called "Young" with the value 1 if the age is less than 9 and 0 otherwise

# Remove the column "Shucked weight"

# Sort the dataframe by the "Age" column

#### Exercise 9: Data Manipulation

The objective of this exercise is to manipulate the data in order to extract information from it.

In [11]:
# Sum all the values of the "Age" column and divide it by the number of rows

# Replace all the values of the "Age" column that are less than 10 by 0

# Using Matplotlib for Data Visualization

Matplotlib is a library that is used to visualize data. It is mainly used to plot graphs. It can be used to easily visualize data in a dataframe.

#### Exercise 10: Plotting an histogram

The objective of this exercise is to plot a graph using Matplotlib, and to understand how parameters can change the graph.

In [12]:
import matplotlib.pyplot as plt

# Plot the histogram of the "Age" column

# Plot the histogram of the "Age" column with 20 bins

# Plot the histogram of the "Age" column with 20 bins and a red color

# Plot the histogram with labels "Age" and "Frequency" as x and y axis respectively

# Add a title to the plot

#### Exercise 11: Plotting a scatter plot

The objective of this exercise is to plot a scatter plot using Matplotlib, and to understand how parameters can change the graph.

In [13]:
# Plot a scatter plot with the "Length" column as x axis and the "Age" column as y axis

# Plot a scatter plot with the "Whole weight" column as x axis and the "Age" column as y axis and 
# set the color of the points to the "Age" column to see the evolution of the age


# Plot a scatter plot the "Length" column as x axis and the "Age" column as y axis with red as color to points
# In the same time, plot the "Whole weight" column as x axis and the "Age" column as y axis with green as color to points


#### Exercise 12: Plotting a 3D scatter plot

The objective of this exercise is to know how to plot in 3D using Matplotlib.

In [None]:
# Plot a 3D scatter plot with the "Age" column as x axis, the "Whole weight" column as y axis and the "Length" column as z axis and Age as color


# Challenge

Three datasets are given below. The objective of this challenge is to find the best way to visualize the data in order to extract information from it. You can use everything you learned in this workshop, and you can also use the [Matplotlib website](https://matplotlib.org/) to find more information about the library.

#### Dataset 1: Marks Dataset

CSV link: https://www.kaggle.com/datasets/yasserh/student-marks-dataset/download?datasetVersionNumber=1

**Objective:**: Find the best way to visualize the data in order to understand the distribution of the marks. The objective is to find the minumum time needed to study in order to get a good mark.

> More info of the dataset: https://www.kaggle.com/datasets/yasserh/student-marks-dataset

#### Dataset 2: Alcohol Consumption Dataset

CSV link: https://www.kaggle.com/datasets/uciml/student-alcohol-consumption/download?datasetVersionNumber=2

**Objective:**: Find the data that is the most correlated with the alcohol consumption. The visualization should be helpful to understand the data.

> More info of the dataset: https://www.kaggle.com/datasets/uciml/student-alcohol-consumption

#### Dataset 3: Titanic Dataset 

CSV link: https://www.kaggle.com/datasets/brendan45774/test-file/download?datasetVersionNumber=6

**Objective:**: Find the best way to classify if a passenger survived or not based on the other columns of the dataset. The visualization should be helpful to understand the data.

> More info of the dataset: https://www.kaggle.com/datasets/brendan45774/test-file


# Conclusion

In this workshop, we covered the basics of Numpy, Pandas and Matplotlib. We learned how to use them to manipulate data and to visualize it. We also learned how to combine them to extract information from data. This is the basis of data science, and it is important to understand how to use other libraries such as Scikit-learn, which we will be cover in future workshops.

# To go further

Matplotlib is a huge library, and we only covered a small portion of it. However, it is the most important part of it, a lot of different graphs can be plotted using Matplotlib. You can find more information about it on the [Matplotlib website](https://matplotlib.org/).
