<a href="https://colab.research.google.com/github/yotam-biu/tutorial8/blob/main/data_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dependency

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Load the data and get a glimpse on it

#### Penguins data

## 1.  
Read the CSV file from the following link into a DataFrame:  
   [Palmer Penguins CSV](https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv)  
   
Use the `pandas` library's `read_csv` function to load the data.  

Inspect the first few rows of the DataFrame to understand its structure and content.  

**Hint:** Remember to pass the URL directly to the function.


In [3]:
file_path = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv'
penguins = pd.read_csv(file_path)
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


## Retrieving data
Retrieving data in data science typically refers to the process of accessing and extracting specific subsets of data from a dataset. This operation is crucial for analyzing and working with data effectively. In this context, retrieving data can involve operations such as filtering rows, selecting columns, sorting, grouping, and aggregating data.


## 2. Filter

Filter the DataFrame to include only rows where the value in the `sex` column is "female".  

Assign the filtered DataFrame to a new variable named `data_female`.  

**Hint:** Use the syntax `data[condition]` to filter rows based on a condition.  


In [4]:
df_female = penguins[penguins['sex']=='female']
print(df_female)

       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
1       Adelie  Torgersen            39.5           17.4              186.0   
2       Adelie  Torgersen            40.3           18.0              195.0   
4       Adelie  Torgersen            36.7           19.3              193.0   
6       Adelie  Torgersen            38.9           17.8              181.0   
12      Adelie  Torgersen            41.1           17.6              182.0   
..         ...        ...             ...            ...                ...   
335  Chinstrap      Dream            45.6           19.4              194.0   
337  Chinstrap      Dream            46.8           16.5              189.0   
338  Chinstrap      Dream            45.7           17.0              195.0   
340  Chinstrap      Dream            43.5           18.1              202.0   
343  Chinstrap      Dream            50.2           18.7              198.0   

     body_mass_g     sex  year  
1         3800.0  

## Cleaning the data

Cleaning data is an essential step in the data science process. It involves identifying and handling missing values, handling duplicates, addressing inconsistencies or errors in the data, and transforming the data into a consistent and usable format.

## 3. Remove None

Use the `dropna()` method of the data-frame object to remove all rows with missing values from the DataFrame.  
Before applying the method, check the number of rows in the DataFrame using `len(data)` to understand the size of the data.  
Apply `dropna()` and assign the result back to the `data` variable.  
After applying the method, check the number of rows again using `len(data)` to confirm how many rows were removed.  



## 4. Remove Duplicate
Use the `drop_duplicates()` method of the data-frame object to remove duplicate rows from the DataFrame.  
Before applying the method, check the number of rows in the DataFrame using `len(data)` to see the current size of the data.  
Then, Apply `drop_duplicates()` and assign the result back to the `data` variable.  
After applying the method, check the number of rows again using `len(data)` to see how many duplicate rows were removed.  


## 5. Filter By Z-Score
Calculate the z-scores for the `body_mass_g` column to identify how far each value deviates from the column's mean in terms of standard deviations.  

Use the formula for z-score:  
$$
   z = \frac{{\text{{value}} - \text{{mean}}}}{{\text{{standard deviation}}}}
$$  

Create a threshold of 2.5 to determine which values are considered outliers (i.e., values with a z-score greater than 2.5).  

Check the `body_mass_g` values of the outliers to understand their magnitudes.


Filter the DataFrame to include only the rows where the `body_mass_g` z-score exceeds the threshold.  

Again, cheak the number of rows before and after the filter.

**Hint:** Use `np.abs()` to ensure all z-scores are positive, and apply a condition to filter rows from the DataFrame.


## EDA

Exploratory Data Analysis (EDA) is a critical step in the data science process. It involves understanding and analyzing the data to uncover patterns, relationships, and insights. EDA helps in formulating hypotheses, identifying data quality issues, selecting appropriate modeling techniques, and preparing the data for further analysis.

In [None]:
data['species'].unique()

In [None]:
data.hist(bins=20);
plt.tight_layout();

In [None]:
data.describe()

group by method

In [None]:
data.groupby('species').mean()

In [None]:
data.groupby('species').agg(['mean', 'median'])  # passing a list of recognized strings

## 6.

* Import the necessary libraries: `matplotlib.pyplot` or `seaborn`.  

* Assume you have a DataFrame named `data` that contains a column named `species` and another numerical column, `flipper_length_mm`.  

* Group the data by the `species` column using the `groupby()` method.  

* For each species, create a histogram of the `flipper_length_mm` column. Use a different color for each species and set the `alpha` parameter to make the bars semi-transparent.  

* Add appropriate labels (`xlabel`, `ylabel`) and a title to the plot to describe the data.  

* Include a legend to identify the species represented by each histogram.  

* Display the histogram using `plt.show()`.  

**Hint:** Use a loop to iterate through each group returned by `groupby()`. The loop will provide the species name and the data subset for that species.


In [None]:
grouped = data.groupby('species')
for species, species_data in grouped:
    print(species, len(species_data))

In [None]:
import matplotlib.pyplot as plt

# Group the data by species

# Plot a histogram for each species with different colors
for species, species_data in grouped:
    pass # replace pass with the loop content

# Add labels and title to the plot

plt.legend()

# Display the plot
plt.show()


In [None]:

import seaborn as sns

sns.set_context("paper", font_scale=2)
sns.relplot(
    data = data,
    x = 'bill_length_mm',
    y = 'flipper_length_mm',
    hue = 'species',
    height=8,
    hue_order = ['Adelie', 'Gentoo', 'Chinstrap']);

## 7.

* Explore the relationship between different columns in the `data` DataFrame by creating your own scatter plot using `sns.relplot()`.  

* Choose different columns for the `x` and `y` axes to visualize a new aspect of the dataset.  

* Use the `hue` parameter to color the points by a categorical column, such as `species` or another relevant column.  

* Customize the plot by setting attributes like `height` or `hue_order` as needed.  

* Experiment with the `style` or `size` parameters in `sns.relplot()` to add more dimensions to your plot.  

* Display your plot and ensure it is clear and well-labeled.  

**Hint:** Refer to the provided example for guidance on using `sns.relplot()`. Consider exploring combinations of columns like `body_mass_g` and `bill_depth_mm`.


Go Over the Following Plots

In [None]:
sns.set_context('talk')
sns.pairplot(data.drop(columns = ['year']), hue='species');

In [None]:
sns.boxplot(data=data, hue = 'species', x='species', y='flipper_length_mm')

In [None]:
columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

fig, axs = plt.subplots(1, len(columns), figsize=(20,5))

for i, col in enumerate(columns):
    sns.boxplot(data=data, x='species', y=col, hue='species', ax = axs[i])
    axs[i].legend([], [], frameon=False)
    axs[i].set_xlabel(None)
    axs[i].set_ylabel(None)
    axs[i].set_title(col, fontsize = 20)

After performing Exploratory Data Analysis (EDA), there are several subsequent stages in the data science process. Here's a brief summary of some common stages that typically follow EDA:

* **Feature Engineering**: This stage involves transforming and creating new features from the existing data to improve the performance of machine learning models. Feature engineering can include techniques such as scaling, normalization, one-hot encoding, handling missing values, creating interaction terms, and deriving new features based on domain knowledge.

* **Model Selection**: In this stage, various machine learning algorithms or models are evaluated and compared to select the most suitable one for the given problem. The choice of the model depends on the nature of the data, the target variable, the available computational resources, and the desired performance metrics.

* **Model Training and Evaluation**: Once the model is selected, it needs to be trained on the labeled data (training set). This involves fitting the model to the data, adjusting its parameters to optimize performance. The trained model is then evaluated using appropriate metrics such as accuracy, precision, recall, F1-score, or area under the ROC curve.

* **Hyperparameter Tuning**: Many machine learning algorithms have hyperparameters that control the model's behavior and performance. Hyperparameter tuning involves selecting the optimal combination of hyperparameters to improve the model's performance. Techniques like grid search, random search, or Bayesian optimization can be used for this purpose.

* **Model Validation and Testing**: After training and tuning the model, it needs to be validated and tested on unseen data to assess its generalization capabilities. The model is evaluated using a separate validation set or through cross-validation techniques. The final model's performance is assessed on a completely independent test set to estimate its real-world performance.

* **Model Deployment**: Once the model is trained, validated, and tested, it can be deployed for real-world use. This involves integrating the model into a production environment, making predictions on new data, and monitoring its performance over time. Deployment may involve the use of frameworks, APIs, or cloud services to enable model serving and inference.

* **Monitoring and Maintenance**: After deployment, it's important to monitor the model's performance and behavior in production. Regular maintenance and retraining may be necessary to keep the model up to date and ensure its continued accuracy and reliability.

These stages are not necessarily linear and may involve iterations and feedback loops. The exact sequence and scope of these stages may vary depending on the specific problem, data, and requirements of the project.