#Day 1: Introduction to Machine Learning
In this session, we'll start with an overview of machine learning and its applications, followed by an introduction to the three main types of machine learning and examples: supervised, unsupervised, and reinforcement learning.

We'll then cover evaluation metrics: accuracy, precision, recall, and F1 score. Finally, we will work on a hands-on exercise by implementing a basic supervised model in Python.

#An overview of machine learning and its applications:
Machine learning is a subfield of artificial intelligence that involves building models that can learn from data, and then use that learning to make predictions or decisions. In machine learning, we typically start with a dataset that includes some inputs and corresponding outputs. The goal is to use this dataset to train a model that can take new inputs and predict the corresponding outputs.

There are several types of machine learning algorithms, including:

- Supervised learning: This involves learning from labeled data, where the inputs and corresponding outputs are provided. Examples of supervised learning algorithms include linear regression, logistic regression, decision trees, and neural networks.

- Unsupervised learning: This involves learning from unlabeled data, where only the inputs are provided. Examples of unsupervised learning algorithms include k-means clustering, hierarchical clustering, and principal component analysis (PCA).

- Reinforcement learning: This involves learning from trial and error, where the model takes actions and receives feedback in the form of rewards or penalties. Examples of reinforcement learning algorithms include Q-learning and policy gradient methods.



Machine learning has a wide range of applications across many different fields, including:

- Natural language processing: Machine learning can be used to build models that can understand and generate natural language. Applications include language translation, sentiment analysis, and chatbots.

- Computer vision: Machine learning can be used to build models that can recognize and interpret images and video. Applications include facial recognition, object detection, and self-driving cars.

- Healthcare: Machine learning can be used to build models that can predict disease progression, identify risk factors for certain conditions, and recommend treatment plans.

- Finance: Machine learning can be used to build models that can predict stock prices, detect fraud, and assess credit risk.

- Marketing: Machine learning can be used to build models that can personalize recommendations, predict customer behavior, and optimize advertising campaigns.

Overall, machine learning has the potential to revolutionize many different industries by enabling more accurate predictions and more efficient decision-making.

#Supervised Learning Techniques

Supervised learning is a type of machine learning that involves learning from labeled data. In supervised learning, we start with a dataset that includes both inputs and corresponding outputs. The goal is to use this data to train a model that can take new inputs and predict the corresponding outputs.

For example, let's say we want to build a model that can predict housing prices based on features such as the number of bedrooms, the square footage, and the location. We might start by collecting a dataset that includes the features of many different houses, as well as their corresponding sale prices. We would then use this data to train a supervised learning model, such as linear regression or a decision tree, that can take the features of a new house and predict its sale price.

Supervised learning can be divided into two main categories:

1.   Regression: This involves predicting a continuous output, such as the sale price of a house.
2.   Classification: This involves predicting a categorical output, such as whether a customer will buy a product or not.




******************************************************************************

#*Supervised Learning - Regression*


Supervised learning regression is a machine learning technique used to predict **continuous numerical values** based on input features, such as predicting the price of a house based on its features. As mentioned earlier, in this type of learning, the algorithm learns from a labeled dataset, where each data point consists of input features and corresponding target values.

The objective of supervised learning regression is to build a model that can generalize patterns in the data and make accurate predictions for unseen examples. The model learns the relationship between the input features and the target variable by fitting a mathematical function to the training data.

Key concepts in supervised learning regression:

1. Input Features: These are the variables or attributes that are used to predict the target variable. 

2. Target Variable: Also known as the dependent variable, it represents the variable that we want to predict using the input features. In regression, the target variable is continuous.

3. Training Data: This is the labeled dataset used to train the regression model.

4. Model Training: During the training phase, the algorithm learns the relationship between the input features and the target variable. It optimizes the model's parameters to minimize the difference between the predicted values and the actual target values.

5. Prediction: After the model is trained, it can make predictions on new, unseen examples. It takes the input features and generates a continuous output value as the prediction.

6. Evaluation: The performance of a regression model is evaluated using various metrics such as mean squared error (MSE), mean absolute error (MAE), or R-squared. These metrics measure the accuracy and goodness of fit of the model's predictions.

Common algorithms used in supervised learning regression include linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression, and neural network regression.

Supervised learning regression is widely applied in various domains such as finance, economics, healthcare, and weather prediction. It can be used to predict housing prices, stock market trends, sales forecasts, patient outcomes, and many other continuous numerical variables.


# *Supervised Learning - Classification*

Supervised learning classification is a machine learning technique used to predict categorical or discrete class labels based on input features. Similar to regression, the classification algorithm learns from a labeled dataset.

The objective of supervised learning classification is to build a model that can classify or categorize new instances into predefined classes based on the patterns it has learned from the training data.

Key concepts in supervised learning classification:

1. Input Features: These are the variables or attributes that are used to predict the class labels. 

2. Class Labels: Also known as the target variable or the dependent variable, they represent the categories or classes that we want to predict using the input features. Class labels can be binary (two classes) or multi-class (more than two classes).

3. Training Data: This is the labeled dataset used to train the classification model. It consists of input features and their corresponding class labels.

4. Model Training: During the training phase, the algorithm learns the relationship between the input features and the class labels. It optimizes the model's parameters to minimize the classification errors and maximize the accuracy of predictions.

5. Prediction: After the model is trained, it can make predictions on new, unseen examples. Given the input features, the model assigns a class label to each instance based on the patterns it has learned from the training data.

6. Evaluation: The performance of a classification model is evaluated using various metrics such as accuracy, precision, recall, F1 score, and confusion matrix. These metrics measure the model's ability to correctly classify instances and assess the quality of its predictions.

Common algorithms used in supervised learning classification include Logistic Regression, Decision Trees, K-Nearest Neighbors, Support Vector Machines (SVM), and Naive Bayes.

Supervised learning classification finds applications in various domains such as spam email detection, sentiment analysis, image recognition, fraud detection, medical diagnosis, and many other tasks where categorical predictions are required.


##Evaluating Metrics

In machine learning, performance measurements are used to evaluate the performance of a model or algorithm on a dataset. These measurements help to assess the accuracy, precision, recall, and other metrics of the model and determine how well it is performing.

There are several common performance measurements used in machine learning, including:

1. Accuracy: The proportion of correct predictions out of all the predictions made by the model. However, in some cases, accuracy may not be the most appropriate metric, especially when the dataset is imbalanced, i.e., when one class has much fewer observations than the others. In such cases, precision, recall, and F1 score may provide more informative evaluation metrics.

2. Precision: The proportion of true positives out of all the positive predictions made by the model. Precision measures how precise the model is when predicting the positive class.

3. Recall: The proportion of true positives out of all the actual positive cases in the dataset. Recall measures how well the model can identify all the positive cases in the dataset.

4. F1 score: The harmonic mean of precision and recall, which combines both metrics into a single value.

5. ROC curve: A graph that shows the trade-off between the true positive rate and false positive rate of the model at different classification thresholds.

6. Confusion matrix: A table that summarizes the predicted and actual classes for a model, which can be used to calculate metrics such as accuracy, precision, recall, and F1 score.

These performance measurements are important for evaluating the effectiveness of machine learning models and choosing the best model for a given task. They can also be used to identify areas for improvement in the model and guide the development of new models.




#Hands-On Exercises

The exercises are from the textbook: [Python Machine Learning Codebook](https://www.oreilly.com/library/view/machine-learning-with/9781491989371/) 

##Problem 1.
##1.1 Loading sample dataset
We will use `scikit-learn` an open-source `scikit-learn` an open source machine learning library in Python to generate simulated data.

In [None]:
# TODO: Load scikit-learn's datasets


# TODO: Load digits dataset


# TODO: Create features matrix


# TODO: Create target vector
 

# TODO: View first observation



Feel free to explore and load some other datasets available in the scikit-learn library.These datasets are commonly referred to as "toy" datasets due to their significantly smaller size and cleaner nature compared to real-world datasets. Some popular sample datasets in scikit- learn are:

**load_boston**

  Contains 503 observations on Boston housing prices. It is a good dataset for exploring regression algorithms.

**load_iris**

  Contains 150 observations on the measurements of Iris flowers. It is a good data‐ set for exploring classification algorithms.

**load_digits**

  Contains 1,797 observations from images of handwritten digits. It is a good data‐ set for teaching image classification.

##1.2 Creating a simulated dataset

Suppose you need to create a dataset of simulated data. `scikit-learn` offers many methods for creating simulated data. Of those, we discuss three methods that are particularly useful: `make_regression`, `make_classification`, and `make_blobs`.

If you are looking for a dataset specifically designed for linear regression, the `make_regression` function is a suitable option.



The `make_regression` function in scikit-learn has several input parameters that allow you to customize the generated dataset. Here are the main parameters:

- `n_samples`: The number of samples in the dataset (default is 100).
- `n_features`: The number of features (independent variables) in the dataset (default is 100).
- `n_targets`: The number of target variables (dependent variables) in the dataset (default is 1).
- `bias`: The bias term in the underlying linear model (default is 0.0).
- `noise`: The standard deviation of the Gaussian noise added to the output variables (default is 0.0).  Higher values of noise will introduce more randomness and variability while lower values will result in a more deterministic relationship between the features and the target.
- `coef`: If set, specifies the coefficient values of the underlying linear model. It can be either a scalar or an array-like object (default is None).
- `random_state`: The seed used by the random number generator for reproducibility (default is None).
- `n_informative` parameter specifies the number of informative features in the generated dataset.

These parameters allow you to control the characteristics of the generated dataset, such as its size, noise level, and coefficient values.

In [None]:
# TODO: load make_regression function from sklearn.datasets library


# Generate features matrix, target vector, and the true coefficients
features, target, coefficients = make_regression(n_samples = 100,
                                                 n_features = 3,
                                                 n_informative = 3,
                                                 n_targets = 1,
                                                 noise = 0.0,
                                                 coef = True,
                                                 random_state = 1)

# TODO: View feature matrix and target vector


If your goal is to generate a synthetic dataset for classification purposes, you can use the `make_classification` function.

In [None]:
# TODO: Load make_classification function from sklearn.datasets library


# Generate features matrix and target vector
features, target = make_classification(n_samples = 100,
                                       n_features = 3,
                                       n_informative = 3,
                                       n_redundant = 0,  #determines the number of redundant features that are generated and added to the dataset.
                                       
                                       n_classes = 2,
                                       weights = [.25, .75], #allows you to specify the class weights for the generated dataset. 
                                                              #It is used to control the balance of samples across different classes in the dataset.
                                       
                                       random_state = 1) #function is used to set the seed value for random number generation. 
                                                          #By providing a specific value to this parameter, you can ensure reproducibility of the generated dataset. 
                                                          #It allows you to obtain the same dataset every time you run the function with the same random_state value.

# TODO: View feature matrix and target vector


And, finally make_blobs function which is useful for unsupervised clustering methods.

In [None]:
# TODO: Load make_blob from sklearn.datasets library


# Generate feature matrix and target vector
features, target = make_blobs(n_samples = 100,
                              n_features = 2,
                              centers = 3,
                              cluster_std = 0.5, #determines the standard deviation of each cluster. It controls the spread or dispersion of the generated blobs. 
                                                  #A higher value of cluster_std results in clusters with greater spread, while a lower value creates more compact clusters.
                              shuffle = True,
                              random_state = 1)

# TODO: View feature matrix and target vector


For make_blobs, the centers parameter determines the number of clusters generated. Using the matplotlib visualization library, we can visualize the clusters generated by make_blobs:

In [None]:
# TODO: Load matpllotlib.pyplot library


# View scatterplot
plt.scatter(features[:,0], features[:,1], c=target)

plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')

# TODO: show plot


See Also

• [make_regression documentation](http://bit.ly/2FtIBwo)

• [make_classification documentation ](http://bit.ly/2FtIKzW)

• [make_blobs documentation](http://bit.ly/2FqKMAZ)

##1.3 Loading a `CSV` file

Suppose you need to read from a `CSV` (Comma-Seperated Values) file from your local machine or hosted CSV file. We use the `read_csv` library from `pandas`.

In [None]:
# TODO: Load pandas library


# TODO: Load dataset
 

# TODO: View first two rows


In [None]:
# Create URL
url = 'https://bit.ly/3oZmLJZ-titanic-csv' 

# TODO: Load dataset


# TODO: View first two rows


##1.4 Loading an Excel file

Use read_excel library from pandas to load an Excel spreadsheet.

In [None]:
# TODO: Load data from your local machine


# TODO: View the first two rows


Working with Excel files is similar to our solution for reading CSV files. The main difference is the additional parameter, sheetname, that specifies which sheet in the Excel file we wish to load. sheetname can accept both strings containing the name of the sheet and integers pointing to sheet positions (zero-indexed). If we need to load multiple sheets, include them as a list. For example, sheetname=[0,1,2, "Monthly Sales"] will return a dictionary of pandas DataFrames containing the first, second, and third sheets and the sheet named Monthly Sales.

##1.5 Loading a JSON file

If you need to load a JSON file for data preprocessing, the read_json function from the pandas library helps you to convert a JSON file into a pandas object.
See Also

• [json_normalize documentation](http://bit.ly/2HQqwaa)

In [None]:
# TODO: Load data from your local machine
dataframe = pd.read_json('file path', orient='columns')
    
# TODO: View the first two rows
dataframe.head(2)

Importing JSON files into pandas is similar to the last few recipes we have seen. The key difference is the orient parameter, which indicates to pandas how the JSON file is structured. However, it might take some experimenting to figure out which argu‐ ment (split, records, index, columns, and values) is the right one. Another helpful tool pandas offers is json_normalize, which can help convert semistructured JSON data into a pandas DataFrame.

##Problem 2
##Data Wrangling
Data wrangling is a broad concept that refers to the process of transforming raw data into a structured and organized format suitable for analysis. In our case, data wrangling is just one step in the data preprocessing phase, but it holds significant importance.

The primary tool commonly used for data wrangling is the data frame, which is a versatile and intuitive data structure. Data frames are tabular in nature, resembling rows and columns similar to a spreadsheet. They provide a convenient way to organize and manipulate data effectively.

Let's create a dataframe from titanic dataset.

In [None]:
# TODO: Load pandas library


# TODO: Create a URL for titanic dataset


# TODO: Load data as a dataframe


# TODO: Show first 5 rows


##2.1 Creating a Data Frame form scratch
Pandas offers several methods for creating a new DataFrame object. One straightforward approach is to create an empty DataFrame using the DataFrame constructor and then define each column individually:

In [None]:
# TODO: Create an empty Dataframe


# TODO: Add three columns: Name, Age, and Driver. 
# The names to be added are: Jack Jackson, Steven Stevenson
# coresponding age: 38 and 25
# driver status: True and False


# TODO: Show dataframe


Alternatively, after creating a DataFrame object, we have the option to add new rows to the bottom:

In [None]:
# Create row
new_person = pd.Series(['Molly Mooney', 40, True], index=['Name','Age','Driver'])

# TODO: Append row


##2.2 Describing the Data

View the first rows:

In [None]:
# TODO: Load data


# TODO: shows fisrt two rows


View number of the rows and columns:

In [None]:
# TODO: show dimensionss


In [None]:
# TODO: Show statistics


##2.3 Navigating Dataframes

Suppose you need to select individual data or slices of the Dataframe. 

In this case, use `loc` or `iloc` to select one or more rows or values

In [None]:
# TODO: Select first row


In [None]:
# TODO: select three rows: 2nd to 4th


##2.4 Selecting Rows Based on Conditionals

Selecting and filtering data based on specific conditions is a frequent task in data wrangling. Rather than working with the entire raw dataset, we often focus on extracting a specific subset of the data that meets our criteria or requirements. For example, suppose we want to select all women on the titanic.

In [None]:
# TODO: Load data


# TODO: Show top two rows where column 'sex' is 'female'


Select all the rows where the passenger is a female 65 or older:

In [None]:
dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age'] >= 65)]

##2.5 Repalcing values
If you want to replace a value in Dataframe use the `replace` function from pandas to find and replace values. For example, you can replace any instance of "Female" in the Sec column with "Woman":

In [None]:
# TODO: Load data
dataframe = pd.read_csv(url)

# TODO: Replace values, show two rows
dataframe['Sex'].replace("female", "Woman").head(2)

We can also replace multiple values at the same time:

In [None]:
# TODO: Replace "female" and "male with "Woman" and "Man"


We can also find and replace across the entire DataFrame object by specifying the whole data frame instead of a single column:

In [None]:
# TODO: Replace values, show two rows


`replace` also accepts regular expressions. 

Regular expressions can be used to match and extract specific patterns or sequences of characters within text data. They are widely used in programming, data processing, and text analysis tasks.

In [None]:
# TODO: Replace values, show two rows


##2.6 Finding the Minimum, Maximum, Sum, Average, and Count

pandas comes with some built-in methods for commonly used descriptive statistics:

In [None]:
# TODO: Load data


# Calculate statistics
# TODO: print the max of the age


# TODO: print the min of the age


# TODO: print the mean of age


# TODO: print the sum of ages


# TODO: print the number of age items


# TODO: Show Dataframe counts


## 2.7 Handling Missing Values

`isnull` and `notnull` return booleans indicating whether a value is missing:

In [None]:
#TODO: select missing values, using isnull(), show two rows
dataframe[dataframe['Age'].isnull()].head(5)



Dealing with missing values is a common challenge in data wrangling, but it can be more complex than anticipated. In pandas, missing values are represented using NumPy's NaN ("Not A Number") value. However, it's important to note that NaN is not fully integrated natively in pandas. For instance, if we attempt to replace all strings containing "male" with missing values, it will result in an error.

In [None]:
# Attempt to replace values with NaN
dataframe['Sex'] = dataframe['Sex'].replace('male', NaN)

To have full functionality with `NaN` we need to import the `NumPy` library first:

In [None]:
#TODO: Load numpy library


#TODO: Replace values with NaN


# TODO:show dataFrame


##2.8 Deleting columns and rows


To delete a column, the most effective approach is to use the "`drop`" function with the parameter "`axis=1`" (referring to the column axis).

In [None]:
# TODO: Delete Age column


You can also use a list of column names as the main argument to drop multiple col‐ umns at once:

In [None]:
#TODO: Drop Age and Sex columns


If a column does not have a name (which can sometimes happen), you can drop it by its column index using dataframe.columns:

In [None]:
#TODO:  Drop 1st column


**Discussion**


The recommended way to delete a column is by using the "drop" method. Another approach is to use `del dataframe['Age']`, which is sometimes effective but not recommended due to how it is implemented in pandas which is beyond the scope of this course.

One suggestion provided by the [book's](https://www.oreilly.com/library/view/machine-learning-with/9781491989371/) author is to avoid using the "inplace=True" argument in pandas. Several pandas methods have an "inplace" parameter that, when set to True, modifies the DataFrame directly. However, this approach can create issues in complex data processing pipelines because it treats DataFrames as mutable objects, even though they are technically mutable. It is advisable to treat DataFrames as immutable objects. For example:
 


In [None]:
# Create a new DataFrame
dataframe_name_dropped = dataframe.drop(dataframe.columns[0], axis=1)

In the above example, we are not modifying the original DataFrame "dataframe" directly. Instead, we are creating a new DataFrame called "dataframe_name_dropped," which is a modified version of the original dataframe. By treating DataFrames as immutable objects and avoiding direct modifications, you can prevent potential complications and avoid future difficulties.

###Deleting a row

To delete one or more rows from a DataFrame, Use a boolean condition to create a new DataFrame excluding the rows you want to delete:

In [None]:
#TODO: Load data


#TODO: Delete rows, show first two rows of output


**Discussion**

Technically, you can use the `drop` method (e.g., `df.drop([0, 1], axis=0`) to drop the first two rows), but a more practical approach is to use a boolean condition within `df[]`. This method allows us to leverage the power of conditionals to delete either a single row or multiple rows at once, which is often more useful.

##2.9 Looping Over a Column

If you want to iterate over every element in a column and apply some action, you can treat a pandas column like any other sequence in Python:

In [None]:
#TODO: Load data


#TODO: Print first two names uppercased using for loop


**Discussion**

In addition to loops (often called for loops), we can also use list comprehensions:

In [None]:
#TODO: Show first two names lowercased using list comprehension


##2.10 Applying a function over all elements in a column
If you want to apply some function over all elements in a column, use `apply` to apply a built-in or custom function on every element in a column:

In [None]:
#TODO: Load data


#TODO: Create function 


#TODO: Apply function, show two rows


##2.11 Concatenating DataFrames
If you want to concatenate two DataFrames, use `concat` with `axis=0` to concatenate along the rows axis:

In [None]:
# create DataFrame
data_a = {'id': ['1', '2', '3'],
          'first': ['Alex', 'Amy', 'Allen'],
          'last': ['Anderson', 'Ackerman', 'Ali']}

dataframe_a = pd.DataFrame(data_a, columns = ['id', 'first', 'last'])

# Create DataFrame
data_b = {'id': ['4', '5', '6'],
          'first': ['Billy', 'Brian', 'Bran'],
          'last': ['Bonder', 'Black', 'Balwner']}

dataframe_b = pd.DataFrame(data_b, columns = ['id', 'first', 'last'])

#TODO: Concatenate DataFrames by rows


**Discussion**

Concatenating is a term commonly used in computer science and programming to describe the act of joining two objects together. In simpler terms, it means to combine or merge two objects. In the provided solution, we merged two smaller DataFrames by specifying the axis parameter, which determines whether the DataFrames are stacked vertically (on top of each other) or horizontally (side by side).

Alternatively we can use append to add a new row to a DataFrame:

In [None]:
# Create row
row = pd.Series([10, 'Chris', 'Chillon'], index=['id', 'first', 'last']) 

#TODO: Append row


##2.12 Merging DataFrames

If you want to merge two DataFrames, to inner join, use `merge` with `on` parameter to specify the column to merge on:

In [None]:
# Create DataFrame
employee_data = {'employee_id': ['1', '2', '3', '4'],
                     'name': ['Amy Jones', 'Allen Keys', 'Alice Bees',
                     'Tim Horton']}

dataframe_employees = pd.DataFrame(employee_data, columns = ['employee_id',
                                                             'name'])

dataframe_employees

In [None]:
# Create DataFrame
sales_data = {'employee_id': ['3', '4', '5', '6'],
              'total_sales': [23456, 2512, 2345, 1455]}

dataframe_sales = pd.DataFrame(sales_data, columns = ['employee_id',
                                                    'total_sales'])

dataframe_sales

In [None]:
#TODO: Merge DataFrames 'inner'


By default, the merge function performs inner joins. However, if we want to perform an outer join, we can specify it using the how parameter.

In [None]:
#TODO: Merge DataFrames 'outer'


Left or right join:

The same parameter can be used to specify left and right joins:

In [None]:
#TODO: Merge DataFrames 'left'
pd.merge(dataframe_employees, dataframe_sales, on='employee_id', how='left')

In [None]:
#TODO: Merge DataFrames 'right'


You can also specify the column name in each DataFrame to merge on:

In [None]:
# Merge DataFrames
pd.merge(dataframe_employees,
             dataframe_sales,
             left_on='employee_id',
             right_on='employee_id')

Unnamed: 0,employee_id,name,total_sales
0,3,Alice Bees,23456
1,4,Tim Horton,2512


**Discussion**

Frequently, the data we work with is complex and not available as a single entity. Instead, we often encounter various datasets from different sources, such as multiple database queries or files. To consolidate all the data into a unified structure, we can load each query or file into separate DataFrames in pandas and then merge them together to create a single DataFrame.

To perform a merge operation, there are three components that need to be specified. 

Firstly, we need to identify the two DataFrames that we want to merge together. In above exrecises, they were assigned the names "dataframe_employees" and "dataframe_sales". 

Secondly, we need to specify the column(s) on which the merge will be based. These columns contain values that are shared between the two DataFrames. Both DataFrames have a column called "employee_id". The merge operation will pair up the values in the "employee_id" column of each DataFrame. If the column names are the same, the "`on`" parameter can be used. However, if the column names differ, we can use "left_on" and "right_on" to specify the corresponding column names from the left and right DataFrames.

What do we mean by the "left" and "right" DataFrames? In simple terms, the "left" DataFrame refers to the first DataFrame that we mention in the merge operation, while the "right" DataFrame is the second one. This terminology becomes relevant again when discussing the subsequent parameters.
The final aspect, which can be a bit challenging for some, is determining the type of merge operation we want to perform. This is indicated by the "how" parameter. The merge function supports the four primary types of joins:

Inner

Return only the rows that match in both DataFrames (e.g., return any row with an employee_id value appearing in both dataframe_employees and data frame_sales).

Outer

Return all rows in both DataFrames. If a row exists in one DataFrame but not in the other DataFrame, fill NaN values for the missing values (e.g., return all rows in both employee_id and dataframe_sales).

Left

Return all rows from the left DataFrame but only rows from the right DataFrame that matched with the left DataFrame. Fill NaN values for the missing values (e.g., return all rows from dataframe_employees but only rows from data frame_sales that have a value for employee_id that appears in data frame_employees).

Right

Return all rows from the right DataFrame but only rows from the left DataFrame that matched with the right DataFrame. Fill NaN values for the missing values (e.g., return all rows from dataframe_sales but only rows from data frame_employees that have a value for employee_id that appears in data frame_sales).