# Recap of Labs 1 and 2

# Lab 1: Locating and Exploring Datasets

Welcome to the first assignment of the course 'AI-powered Data Analysis'. In this assignment, you will learn how to locate and open datasets stored as CSV files, understand their structure, and explore their components.


## Introduction

In this assignment, we will explore the process of data analysis, starting from the very basics of locating and opening datasets saved as CSV files, understanding their structure, and exploring their components.

### Importance of Metadata

When given a dataset, the first step is to look at the metadata. Metadata provides crucial information about the data, such as:
- The structure of the dataset (e.g., column names, data types)
- Descriptions of the data fields
- Information about the source and context of the data

Understanding the metadata is beneficial because:
- It helps in comprehending the dataset's structure and content.
- It provides insights into the data quality and potential preprocessing steps needed.
- It aids in planning the data analysis and visualization strategies effectively.

### Datasets Overview

We will be using three datasets in this assignment:

1. **NOAA Weather Dataset**
    - This dataset contains weather data collected by the National Oceanic and Atmospheric Administration (NOAA).
    - [Link to Metadata](#)

2. **Kaggle Ecommerce Dataset**
    - This dataset comprises e-commerce data from Kaggle, including information about transactions, products, and customers.
    - [Link to Metadata](#)

3. **Yelp Reviews Dataset**
    - This dataset includes reviews from Yelp, with columns such as 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time', 'Summary', and 'Text'.
    - [Link to Metadata](#)

After understanding the metadata and the structure of our datasets, we will proceed with coding to explore and analyze the data.


### Datasets Information

In this course, we have three different datasets housed within the following file directory structure:

As you can see, a dataset may contain multiple CSV files. For this assignment, we will be using the `bank_churners.csv` file in the `Kaggle_Ecommerce` folder.


## Introduction to Dataset Overview and Metadata

Before diving into data analysis, it is crucial to get an overview of the dataset and understand its metadata. This initial step provides valuable insights into the data's structure, quality, and the types of information it contains.

### Why Summarize the Dataset Overview?

- **Quick Insights:** A summary gives a quick glance at the main features of the dataset.
- **Data Quality:** Helps in identifying any immediate data quality issues.
- **Preparation for Analysis:** Prepares the ground for more detailed data analysis and visualization.

## 1. Locating and Opening CSV Files

First, let's locate and open the CSV files. We'll use the Kaggle Ecommerce Data for this exercise.

### List all files in the `Kaggle Ecommerce` folder
We will use the `os` module to list all files in the folder.

1. **Import the `os` module**

```python
import os
```

The `os` module provides functions to interact with the operating system, such as listing files in a directory.

2. **List all files in the `Kaggle Ecommerce` folder**

```python
files = os.listdir('../Datasets/Kaggle_Ecommerce')
files
```

- `os.listdir('../Datasets/Kaggle_Ecommerce')` lists all files and directories in the `Kaggle_Ecommerce` folder.
- The result is stored in the variable `files`.
- `files` is then displayed to show the list of files.

In [None]:
import os

# List all files in the folder
files = os.listdir('../Datasets/Kaggle_Ecommerce')
files

### Explanation of the above code cell
The above code cell imports the `os` module and lists all files in the `Kaggle_Ecommerce` folder. The `os.listdir()` function retrieves the names of all files and directories in the specified folder. The result is stored in the variable `files`, which is then displayed. This helps in identifying the available CSV files for further analysis.

### Read and display the first few rows of a CSV file
We will use the `pandas` library to read the CSV file and display its first few rows.

1. **Import the `pandas` library**

```python
import pandas as pd
```

The `pandas` library is used for data manipulation and analysis.

2. **Read the first CSV file**

```python
file_path = os.path.join('../Datasets/Kaggle_Ecommerce', files[0])
```

- `os.path.join('../Datasets/Kaggle_Ecommerce', files[0])` creates the full path to the first file in the `Kaggle_Ecommerce` folder.
- The result is stored in the variable `file_path`.

3. **Load the CSV file into a DataFrame**

```python
df = pd.read_csv(file_path)
```

- `pd.read_csv(file_path)` reads the CSV file and loads its contents into a `pandas` DataFrame.
- The DataFrame is stored in the variable `df`.

4. **Display the first few rows of the DataFrame**

```python
df.head()
```

- `df.head()` displays the first five rows of the DataFrame.

In [None]:
import pandas as pd

# Read the bank_churners CSV file
file_path = os.path.join('../Datasets/Kaggle_Ecommerce', files[0])
df = pd.read_csv(file_path)
df.head()

The first five rows of the DataFrame are displayed using `df.head()`. This provides a quick overview of the data structure and the initial few records.


### Get the number of rows in the DataFrame

We will use the `len()` function to get the number of rows in the DataFrame.

1. **Get the number of rows**

```python
num_rows = len(df)
num_rows

In [None]:
num_rows = len(df)
num_rows

This number represents the total number of records in the dataset. This information is crucial for understanding the dataset's size.

## 2. Understanding the Structure

Let's get more information about the dataset to understand its structure better.

### Get the column names in the DataFrame

We will use the `.columns` attribute of the DataFrame to get the column names.

1. **Get the column names**

```python
columns = df.columns
columns

In [None]:
columns = df.columns
columns

The list of items inside the square brackets `[]` is the columns that the DataFrame has. This allows you to see all the column names at a glance. This can be especially helpful when dealing with large datasets or when you want to programmatically interact with the columns.

### Get basic information about the dataframe
We will use the `info()` method to get a concise summary of the dataframe, including the number of non-null values and data types of each column.

1. **Get a concise summary of the DataFrame**

```python
df.info()
```

- `df.info()` displays a concise summary of the DataFrame.
- It shows the number of non-null values, data types of each column, and memory usage.

In [None]:
df.info()

**A detailed breakdown of the above output is:**


**Class Type:**

- The first line `<class 'pandas.core.frame.DataFrame'>` tells you that the object is a DataFrame.

**Index Range:**

- `RangeIndex: 10127 entries, 0 to 10126` indicates that the DataFrame has an Index with 10127 entries ranging from 0 to 10126.

**Column Information:**

- `Data columns (total 21 columns):` indicates that there are 21 columns in the DataFrame.

**Column Details:**

- For each column, you get the column number (starting from 0), the column name, the count of non-null values, and the data type
- `non-null` means there are no null entries in the column

**Data Types Summary**

- `dtypes: float64(5), int64(10), object(6)` summarizes the data types present in the DataFrame and their counts, `object` in Pandas represents a string.

**Memory** 
- `memory usage: ... MB` indicates the memory usage of the DataFrame.

## Data Preprocessing and Cleaning

## 1. Data Preprocessing and Cleaning

Data preprocessing and cleaning are critical steps in the data analysis pipeline. The quality of data directly impacts the quality of insights that can be derived from it. Preprocessing involves transforming raw data into a clean and usable format. Cleaning involves handling missing values, correcting errors, and preparing the data for analysis.

### Handling Missing Values
Missing values are common in datasets and can significantly affect the results of your analysis. Common strategies to handle missing values include:
- **Removal**: Removing rows or columns with missing values.
- **Imputation**: Filling missing values with a specific value such as the mean, median, or mode of the column.
- **Prediction**: Using machine learning models to predict missing values based on other features.

Let's start by importing the required libraries and loading the CSV file for `shopping_behavior` dataset in `Kaggle Ecommerce` and examining the data to identify errors.

In [None]:
# Load the dataset
import os
import pandas as pd
import numpy as np

file_path = '../Datasets/Kaggle_Ecommerce/shopping_behavior.csv'
shop_behav = pd.read_csv(file_path)
shop_behav.head()

### Handling Missing Values
Identify and handle missing values in the dataset.

## Data Wrangling

This code checks for missing values in the `shop_behav` DataFrame.

- **shop_behav.isnull()**: Identifies all the null (missing) values in the DataFrame.
- **sum()**: Counts the total number of missing values in each column.

The result shows the number of missing values per column, helping us understand the extent of missing data in our dataset.

In [None]:
# Identify missing values
shop_behav.isnull().sum()

This code handles missing values in the `shop_behav` DataFrame. Here, we are filling the missing values with the mean of the corresponding column.

- **missing_cols**: Identifies columns with any missing values.
- **for col in missing_cols**: Iterates through each column with missing values.
    - **if shop_behav[col].dtype in [np.float64, np.int64]**: Checks if the column is numerical.
    - **shop_behav[col].fillna(shop_behav[col].mean(), inplace=True)**: Fills missing values in numerical columns with the column mean.

## Data Analysis and Visualization

In [None]:
# Identify columns with missing values
missing_cols = shop_behav.columns[shop_behav.isnull().any()]

# Fill missing values in numerical columns with mean
for col in missing_cols:
    if shop_behav[col].dtype in [np.float64, np.int64]:
        shop_behav[col].fillna(shop_behav[col].mean(), inplace=True)

We again check for missing values, and as can be seen, there are none left.

In [None]:
# Rechecking for missing values
shop_behav.isnull().sum()

### Removing Duplicates
Duplicate records can skew your analysis and lead to incorrect insights. Removing duplicates ensures that each record in your dataset is unique. This is typically done by identifying and removing rows that have identical values across all columns.

This code checks for duplicate rows in the `shov_behav` DataFrame.

- **data2.duplicated()**: Identifies duplicate rows.
- **sum()**: Counts the total number of duplicate rows in the DataFrame.

In [None]:
# Identify duplicates
shop_behav.duplicated().sum()

This indicates that the dataset has one duplicate row. We will fix it now.

- **shop_behav.drop_duplicates(inplace=True)**: Removes duplicate rows from the DataFrame and updates `shop_behav` in place.

In [None]:
# Remove duplicates
shop_behav.drop_duplicates(inplace=True)
shop_behav.duplicated().sum()

The dataset now has no duplicates

### Data Type Conversion
Ensuring that each column has the correct data type is crucial for accurate analysis because each column in a DataFrame can only contain one data type. This restriction comes from the underlying structure of a DataFrame, which is essentially a 2D array, where each column must be homogenous, unlike a list that can contain mixed types. For example, dates should be converted to datetime objects, numerical data should be in appropriate numerical formats, and categorical data should be stored as category types where applicable.

Let's consider the example of the 'Review Rating' column. This column should be of float type, but due to some rows (as shown below), the datatype is currently an object.

In [None]:
shop_behav['Review Rating'][55:60]

1. We will try to convert the column to float.
    ```python
    try:
        shop_behav['Review Rating'] = shop_behav['Review Rating'].astype('float')
    except ValueError as e:
        print(f"Error encountered: {e}")
    ```
    - This block attempts to directly convert the 'Review Rating' column to float.
    - A `ValueError` is encountered because some values contain the string ' stars', which cannot be converted to float.
    - The error message is printed to identify the issue.

In [None]:
# Attempt to directly convert 'Review Rating' to float
try:
    shop_behav['Review Rating'] = shop_behav['Review Rating'].astype('float')
except ValueError as e:
    print(f"Error encountered: {e}")

2. Now we'll properly format the column before converting it to a string
    ```python
    shop_behav['Review Rating'] = shop_behav['Review Rating'].str.rstrip(' stars').astype('float') 
    ```
    - **shop_behav['Review Rating'].str.rstrip(' stars')**: Removes the trailing ' stars' string from each value in the 'Review Rating' column.
    - **astype('float')**: Converts the cleaned string values to float.
    - This ensures the 'Review Rating' column has the correct numerical data type.

3. **Displaying Data Types**:
    ```python
    shop_behav.dtypes
    ```
    - Displays the data types of all columns in the `shop_behav` DataFrame to verify the conversion.

In [None]:
# Fixing the 'Review Rating' column by removing the ' stars' string and converting to float
shop_behav['Review Rating'] = shop_behav['Review Rating'].str.rstrip(' stars').astype('float') 

# Display data types of the columns
shop_behav.dtypes

Again checking the same rows of the column, we can see that the data type is now `float64`:

In [None]:
shop_behav['Review Rating'][55:60]

## 2. Data Wrangling

Data wrangling, also known as data munging, involves transforming and mapping data from its raw form into another format to make it more appropriate and valuable for analysis. This process includes merging datasets, reshaping data, and creating new variables.

### Merging CSV Files
When working with large datasets, data may be split across multiple files. Merging these files into a single dataset is often necessary. This involves reading each file and concatenating them into one dataframe.

We first load the NOAA dataset and list the files it has.

In [None]:
# List and load CSV files for the dataset
folder_path = '../Datasets/NOAA_Weather'
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

csv_files

We can see that it has three different CSV files, which are basically weather data recordings from three different stations. Let's suppose we want to perform an analysis for all three stations, it is much more efficient to concatenate all of them into one and then perform the required tasks.

The below code concatenates multiple CSV files into a single DataFrame.

- **pd.concat([...])**: Concatenates the list of DataFrames into a single DataFrame.
- **[pd.read_csv(os.path.join(folder_path, file)) for file in csv_files]**: This list comprehension reads each CSV file in the `csv_files` list and returns a list of DataFrames.
    - **os.path.join(folder_path, file)**: Constructs the full file path for each CSV file.
    - **pd.read_csv(...)**: Reads the CSV file into a DataFrame.
- **ignore_index=True**: Ensures that the resulting DataFrame has a new, continuous index.

The result is a single DataFrame, `noaa`, containing the data from all the CSV files.

In [None]:
# Load and concatenate all CSV files
noaa = pd.concat([pd.read_csv(os.path.join(folder_path, file)) for file in csv_files], ignore_index=True)
noaa.head()

To verify that `noaa` indeed has all three stations:

In [None]:
noaa['STATION'].unique()

### Creating New Columns
Creating new columns from existing data can provide additional insights or make data analysis easier. This can involve operations like arithmetic transformations, conditional logic, or feature engineering.

This code creates a new column 'COORDINATES' in the `noaa` DataFrame by concatenating the 'LATITUDE' and 'LONGITUDE' columns as strings.

- **noaa['LATITUDE'].astype('str')**: Converts the 'LATITUDE' column to strings.
- **noaa['LONGITUDE'].astype('str')**: Converts the 'LONGITUDE' column to strings.
- **noaa['LATITUDE'].astype('str') + ',' + noaa['LONGITUDE'].astype('str')**: Concatenates the latitude and longitude values with a comma in between to form coordinate strings.
- **noaa['COORDINATES']**: Assigns the resulting coordinate strings to a new column 'COORDINATES' in the DataFrame.

In [None]:
# Create a new column "Coordinates", which is "Latitude, Longitude"
noaa['COORDINATES'] = noaa['LATITUDE'].astype('str') + ',' + noaa['LONGITUDE'].astype('str')
noaa.head()

## 3. Data Analysis

In this section, we will perform various data analysis tasks on the Yelp reviews dataset. This includes descriptive statistics, correlation analysis, grouping and aggregation, and trend analysis.

Let's first load the dataset.

In [None]:
file_path = '../Datasets/Yelp_Reviews/reviews.csv'
reviews = pd.read_csv(file_path)
reviews.head()

### Distribution of Ratings

We analyze the distribution of ratings in the Yelp reviews dataset. Understanding the distribution of ratings can provide insights into customer satisfaction and help identify patterns or trends in the feedback.

- **rating_distribution = reviews['stars'].value_counts().sort_index()**:
    - **reviews['stars']**: Select the 'stars' column from the DataFrame `reviews`, which contains the ratings given in the reviews.
    - **value_counts()**: Counts the occurrence of each unique rating value, giving us the number of reviews for each rating.
    - **sort_index()**: Sorts the counts by the rating values (index) in ascending order.

- **rating_distribution**:
    - This variable now holds a Series with the count of reviews for each rating, sorted by the rating values. It provides a clear view of how many reviews were given for each rating level (e.g., 1 star, 2 stars, etc.).

In [None]:
# Distribution of Ratings
rating_distribution = reviews['stars'].value_counts().sort_index()
rating_distribution

### Useful Reviews

This code provides a statistical summary of the 'useful' ratings in the Yelp reviews dataset. The 'useful' ratings indicate how many users found a review helpful. Analyzing this data helps understand the general usefulness of reviews from the perspective of other users.

- **useful_distribution = reviews['useful'].describe()**:
    - **reviews['useful']**: Selects the 'useful' column from the DataFrame `reviews`, which contains the count of how many users marked each review as useful.
    - **describe()**: Generates a summary of statistics for the 'useful' column, including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.

- **useful_distribution**:
    - This variable now holds a Series with the statistical summary of the 'useful' ratings. It provides insights into the distribution and central tendencies of how useful users find the reviews.

In [None]:
# Useful Reviews
useful_distribution = reviews['useful'].describe()
useful_distribution

### Review Length vs Useful Votes

Here we analyze the relationship between the length of a review and the number of useful votes it receives. By examining this relationship, we can understand if longer reviews tend to be more useful to readers.

- **reviews['review_length'] = reviews['text'].apply(len)**:
    - **reviews['text']**: Selects the 'text' column from the DataFrame `reviews`, which contains the review texts.
    - **apply(len)**: Applies the `len` function to each review text, calculating the length of each review in terms of the number of characters.
    - **reviews['review_length']**: Creates a new column 'review_length' in the DataFrame `reviews` to store the length of each review.

- **review_length_vs_useful = df[['review_length', 'useful']]**:
    - Selects the 'review_length' and 'useful' columns from the DataFrame `reviews` and creates a new DataFrame `review_length_vs_useful` containing these two columns.

- **review_length_vs_useful.head()**:
    - Displays the first five rows of the `review_length_vs_useful` DataFrame. This provides a quick look at the data, showing the length of the reviews alongside the number of useful votes they received.

In [None]:
# Review Length vs Useful Votes
reviews['review_length'] = reviews['text'].apply(len)
review_length_vs_useful = reviews[['review_length', 'useful']]
review_length_vs_useful.head()

### Sentiment Analysis

We apply sentiment analysis to the review texts to classify them as Positive, Negative, or Neutral.

In [None]:
from textblob import TextBlob

# Function to classify sentiment
def classify_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

# Apply sentiment analysis to the text column
reviews['sentiment'] = reviews['text'].apply(classify_sentiment)
sentiment_distribution = reviews['sentiment'].value_counts()
sentiment_distribution

## 4. Data Visualization

In this section, we will create various plots to visualize the data and the results of our analysis.

### Bar Chart for Distribution of Ratings
on
This plot is a bar chart that visualizes the distribution of star ratings in the Yelp reviews dataset. It shows how frequently each star rating (1 to 5 stars) is given, providing insights into overall customer satisfactiotion

- **import matplotlib.pyplot as plt**:
    - Imports the `matplotlib.pyplot` module, which is used for creating visualizations.

- **plt.figure(figsize=(10, 6))**:
    - Creates a new figure with a specified size of 10 inches in width and 6 inches in height.

- **reviews['stars'].value_counts().sort_index().plot(kind='bar')**:
    - **reviews['stars']**: Selects the 'stars' column from the DataFrame `reviews`.
    - **value_counts()**: Counts the occurrence of each unique rating value.
    - **sort_index()**: Sorts the counts by the rating values in ascending order.
    - **plot(kind='bar')**: Creates a bar plot of the sorted rating counts.

- **plt.xlabel('Star Ratings')**:
    - Sets the label for the x-axis to 'Star Ratings'.

- **plt.ylabel('Frequency')**:
    - Sets the label for the y-axis to 'Frequency'.

- **plt.title('Distribution of Star Ratings')**:
    - Sets the title of the plot to 'Distribution of Star Ratings'.

- **plt.show()**:
   - Displays the bar chart.



In [None]:
import matplotlib.pyplot as plt

# Bar Chart for Distribution of Ratings
plt.figure(figsize=(10, 6))
reviews['stars'].value_counts().sort_index().plot(kind='bar')
plt.xlabel('Star Ratings')
plt.ylabel('Frequency')
plt.title('Distribution of Star Ratings')
plt.show()

### Scatter Plot for Review Length vs Useful Votes

This plot is a scatter plot that visualizes the relationship between the length of reviews and the number of useful votes they receive. Each point represents a review, with its position determined by the review's length and the number of useful votes. This helps in identifying any patterns or correlations between these two variables.

- **plt.scatter(reviews['review_length'], reviews['useful'], alpha=0.5)**:
    - **plt.scatter()**: Creates a scatter plot.
    - **reviews['review_length']**: Specifies the x-axis data, which is the length of the reviews.
    - **reviews['useful']**: Specifies the y-axis data, which is the number of useful votes.
    - **alpha=0.5**: Sets the transparency level of the points to 0.5, making it easier to see overlapping points.

- **plt.xlabel('Review Length')**:
    - Sets the label for the x-axis to 'Review Length'.

- **plt.ylabel('Number of Useful Votes')**:
    - Sets the label for the y-axis to 'Number of Useful Votes'.

- **plt.title('Review Length vs Useful Votes')**:
    - Sets the title of the plot to 'Review Length vs Useful Votes'.

- **plt.show()**:
    - Displays the scatter plot.


In [None]:
# Scatter Plot for Review Length vs Useful Votes
plt.figure(figsize=(10, 6))
plt.scatter(reviews['review_length'], reviews['useful'], alpha=0.5)
plt.xlabel('Review Length')
plt.ylabel('Number of Useful Votes')
plt.title('Review Length vs Useful Votes')
plt.show()

### Line Chart for Average Ratings Over Time
n
This plot is a line chart that visualizes the average star rating of reviews over time. It shows how the average rating has changed monthly, allowing for the identification of trends and patterns in customer satisfaction over the observed perioheight.

- **monthly_avg_rating.plot()**:
    - Plots the `monthly_avg_rat from the Data Analysis sectioning` Series, which contains the average star rating for each month, as a line chart.

- **plt.xlabel('Date')**:
    - Sets the label for the x-axis to 'Date'.

- **plt.ylabel('Average Star Rating')**:
    - Sets the label for the y-axis to 'Average Star Rating'.

- **plt.title('Average Star Rating Over Time')**:
    - Sets the title of the plot to 'Average Stasplays the line chart.


In [None]:
# Line Chart for Average Ratings Over Time
plt.figure(figsize=(10, 6))
monthly_avg_rating.plot()
plt.xlabel('Date')
plt.ylabel('Average Star Rating')
plt.title('Average Star Rating Over Time')
plt.show()

The breaks in the lines indicate that no data was available for these months.

### Pie Chart for Sentiment Distribution

This plot is a pie chart that visualizes the distribution of sentiment classifications (Positive, Negative, Neutral) in the Yelp reviews dataset. It shows the proportion of each sentiment category, providing insights into the overall sentiment of the reviews.

- **sentiment_counts = reviews['sentiment'].value_counts()**:
    - Counts the occurrences of each sentiment category in the 'sentiment' column of the DataFrame `reviews`, which we created in the Data Analysis section.
    - **sentiment_counts**: Stores the counts of each sentiment.

- **plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140)**:
    - **plt.pie()**: Creates a pie chart.
    - **sentiment_counts**: Provides the data for the pie chart (the counts of each sentiment).
    - **labels=sentiment_counts.index**: Labels each slice of the pie chart with the sentiment categories.
    - **autopct='%1.1f%%'**: Displays the percentage of each slice with one decimal place.
    - **startangle=140**: Rotates the start of the pie chart to 140 degrees for better visual presentation.

- **plt.title('Sentiment Distribution of Reviews')**:
    - Sets the title of the plot to 'Sentiment Distribution of Reviews'.

- **plt.show()**:
    - Displays the pie chart.

In [None]:
# Pie Chart for Sentiment Distribution
# Count the occurrences of each sentiment
sentiment_counts = reviews['sentiment'].value_counts()

# Plot the pie chart
plt.figure(figsize=(8, 8))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Sentiment Distribution of Reviews')
plt.show()

### Heat Map for Numeric Columns

This plot is a heat map that visualizes the correlation between numeric columns in the Yelp reviews dataset. The heat map shows the strength and direction of the relationships between pairs of variables, helping to identify patterns and potential dependencies.

- **import seaborn as sns**:
    - Imports the `seaborn` library, which is used for creating advanced visualizations.

- **numeric_cols = ['stars', 'useful', 'funny', 'cool']**:
    - Defines a list of numeric columns that will be included in the heat map.

- **corr_matrix = df[numeric_cols].corr()**:
    - Computes the correlation matrix for the selected numeric columns.
    - **df[numeric_cols]**: Selects the specified numeric columns from the DataFrame `df`.
    - **corr()**: Calculates the pairwise correlation coefficients between the columns.

- **plt.figure(figsize=(10, 8))**:
    - Creates a new figure with a specified size of 10 inches in width and 8 inches in height.

- **sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')**:
    - **sns.heatmap()**: Plots the heat map.
    - **corr_matrix**: Provides the data for the heat map (the correlation matrix).
    - **annot=True**: Displays the correlation coefficients on the heat map.
    - **cmap='coolwarm'**: Uses the 'coolwarm' colormap for the heat map.
    - **fmt='.2f'**: Formats the correlation coefficients to two decimal places.

- **plt.title('Correlation Heatmap of Numeric Columns')**:
    - Sets the title of the plot to 'Correlation Heatmap of Numeric Columns'.

- **plt.show()**:
    - Displays the heat map.


In [None]:
import seaborn as sns

# Heat Map for Numeric Columns
# Select numeric columns for the heat map
numeric_cols = ['stars', 'useful', 'funny', 'cool']

# Compute the correlation matrix
corr_matrix = reviews[numeric_cols].corr()

# Plot the heat map
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numeric Columns')
plt.show()

## 4. Reflection Exercise

Based on your inputs from the previous two exercises, here is your personalized prompt that you can use with a Generative AI model of your choice to obtain a comprehensive data analysis guide tailored to your needs. This guide will help you navigate through your data analysis tasks with ease and precision.

In [None]:
with open('..\Prompts\Prompt.txt', 'r') as file:
    prompt = file.read()
prompt

lines = prompt.splitlines()

for line in lines:
    print(line)