---
jupytext:
  formats: md:myst
  text_representation:
    extension: .md
    format_name: myst
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# AI-powered Data Analysis

In this notebook, you'll follow a guided walkthrough of how GenAI can be leveraged in the data analysis process. By the end, you'll have an open sandbox where you can experiment in this Python environment with your own AI-generated code. Press `Shift` + `Enter` to easily move forward in the notebook!

### Learning to Think Step-by-Step in Data Analysis

The main thing I want you to learn is to tackle data analysis in a, well, analytical way, breaking down the data analysis process into a clear step-by-step process:

1. **Contextualize:** Identify your context. What data are you working with? What's the general topic of what you're trying to do? This initial context is crucial for framing your analysis.

2. **Set a Goal:** What is your general goal? What do you want to achieve with this data analysis? Having a clear objective is key.

3. **Strategize:** With that goal in mind, what's a good strategy for getting there? What tools do you want to use? What approaches or methods? What order do you need to do things in? Planning your approach ensures better outcomes.

4. **Implement:** A carefully articulated strategy should help you in this main process that implements your tools and methods. Many people skip to this step without doing the work beforehand!

5. **Interact:** This is the iterative process of data analysis. You will need to interpret your results, troubleshoot if things aren't going your way, or iterate to get a more refined version of what you want.

6. **Document:** Once you've got what you want, make sure you record—in detail—what you did to get there! This is important both for sharing data and for revisiting your analyses later yourself.

### The Role of AI in Data Analysis

How you interact with AI in the context of data analysis will depend on the AI tools you're using, the data, and your experience, among other factors. A useful way to direct your interactions is to consider different roles that AI can take on in this process. These roles roughly correspond to a mode of interaction and your comfort and experience level with the specific data analysis process you want to engage in.

| Role       | Mode       | Level       |
|:----------:|:----------:|:-----------:|
| Tutor      | Learning   | Novice      |
| Co-pilot   | Exploring  | Intermediate|
| Intern     | Producing  | Expert      |

Take these levels with a grain of salt because you might be experienced or advanced in some areas of data analysis but want to engage with AI as a tutor to learn a new analysis or use a new package.

In the walkthrough, you'll see tabs for these different roles/modes of engaging with AI at each step.

We will be working with the datasets that we described in Lab 1. 

# Step 1: Contextualize

Let's assume that your starting point is a dataset you have received and want to explore: `../Datasets/NOAA_Weather/udskoe-russia.csv`

## Set up your GenAI tool

It's important to set up whatever your tool is with the context of your dataset and what you'll be engaging in.

For the purposes of this lab, I will be using ChatGPT with the GPT-4 model (I prefer it over the GPT-4o model) with code interpreter. And the initializing prompts I'm working with for each mode will be:

::::{tab-set}

:::{tab-item} Tutor (Learning Mode)
:sync: tab-tutor-contextualize
```
Act as a Data Analysis Tutor to provide a strong educational foundation for my data analysis project.

Responsibilities:

1. Educate: Explain each step and decision clearly.
2. Guide: Use the provided CSV file for illustrations and answering questions.
3. Respond Patiently: Answer queries with clear, instructive insights, waiting for my cues.
4. Review: Discuss errors or misconceptions post-evaluation.
5. Confirm: Paraphrase my instructions to ensure alignment.

Working Environment: Jupyter Notebook.

Paraphrase my instructions to verify your comprehension.
```
:::

:::{tab-item} Co-pilot (Exploring Mode)
:sync: tab-copilot-contextualize
```
Serve as a Data Analysis Copilot to navigate my data analysis project.

Responsibilities:

1. Collaborate: Understand data nuances influencing our analysis.
2. Integrate: Use the provided CSV file in our workspace for discussions.
3. Dialogue: Engage in a two-way interaction, pausing for my input.
4. Review: Jointly assess results, considering improvements.
5. Confirm: Echo my directives to ensure synchronization.

Working Environment: Jupyter Notebook.

Echo my objectives back to confirm alignment.

```
:::

:::{tab-item} Intern (Producing Mode)
:sync: tab-intern-contextualize
```
Function as a Data Analysis Intern, executing tasks I delegate.

Responsibilities:

1. Query: Request details influencing task outcomes.
2. Execute: Load and apply the provided CSV file as instructed.
3. Conform: Follow instructions strictly, without introducing new steps.
4. Feedback: Confirm if steps align with objectives post-execution.
5. Repeat: Echo my instructions to demonstrate adherence.

Working Environment: Jupyter Notebook.

Retell my commands to confirm accurate following.
```
:::

And, yes, these were generated and iterated with AI!

::::

:::{attention} Prompts are not magic or universal
:class: dropdown

These suggested prompts are a _starting point_, but you'll have to actually put some thought into what makes sense for you to include in your prompt:
- How big is your context window? (i.e. how much text can you put in there)
- Does your tool have a tendency to give verbose (long, wordy) replies?
- Can you access other settings like the systems prompt etc?

**There are no magic words that will reliably get you a perfect result from an AI chatbot**. Even when you do find something close to a "perfect prompt", it may stop working after the model is updated or some other aspect of the tool's design is changed.

Any of these will affect the best way to get the most use out of your AI tool. This isn't even covering the fact that many IDE's are now incorporating GenAI into their products, meaning you can often talk to GPT-4, Gemini and other AI model's directly from your notebook.

Instead of focusing on optimizing for the current capabilities of the AI tools around you, focus on understanding the _way_ you can delegate and automate aspects of data analysis given the components of that process--i.e. the steps in this exercise!
:::

Depending on your tool of choice, you may note be able to refer to a CSV file or have it run code with that CSV file. Below are some suggested initializing prompts from our readings if you want ideas of what direction to go in for adapting your prompt.

::::{note} Adapting context depending on tool

Below are some example prompts from the Step 0: Context and Setup reading if you need a refresher of what to consider for tool-specific prompting.

:::{seealso} Basic (no code interpreter or file upload)
:class: dropdown

Start the conversation off by specifying your situation and what you’ll be trying to do. I like to prompt with a role I want the GenAI bot to take on.

An example of what that initial prompt might look like if you can't upload your data:

```
Role: Act as a Data Analysis Copilot, providing advice and educational explanations on how to approach my data analysis project.

Responsibilities:

Inquire and Clarify: Ask about details that can impact your advice (e.g., data types, dataframe or variable attributes).

Contextual Understanding: Use the provided pasted data as context for answering my questions.

Here is the data:
<data>
{paste in some data here,
depending on context window,
it may only be a few lines}
</data>

Direct Responses: Answer my questions directly and do not proceed with additional steps until I explicitly ask.

Concise and Educational Explanations: Provide concise explanations, discuss the general consensus on different options, and give clear recommendations on how I should proceed, explaining the reasoning behind your advice.

Verification Guidance: Provide instructions on how I can verify that the code works and achieves the intended goal.

Working Environment: I am using a Jupyter notebook for my work.

Repeat back the instructions I have given to ensure understanding.
```
The last sentence is mainly so that you can separate your first actual query from this role setting stage, and it should give you an idea of how the model is interpreting your instructions. 

The details of this are beyond the scope of this short course, but you can think of it this way: your input determines your output, and priming the conversation by giving context will influence the output.

Feel free to copy this template and adjust as needed.

:::

:::{seealso} File upload (no code interpreter)
:class: dropdown
Same as above but you can just say you attached the file instead of pasting it in, you can reference it as an attached file.


```
Role: Act as a Data Analysis Copilot providing advice and educational explanations on how to approach my data analysis project.

Responsibilities:

1. Inquire and Clarify: Ask about details that can impact your advice (e.g., data types, dataframe or variable attributes).
    
2. Contextual Understanding: Load and use the attached spreadsheet (CSV file) as context for answering my questions.
    
3. Direct Responses: Answer my questions directly and do not proceed with additional steps until I explicitly ask.
    
4. Concise and Educational Explanations: Provide concise explanations, discuss the general consensus on different options, and give clear recommendations on how I should proceed, explaining the reasoning behind your advice.
    
5. Verification Guidance: Provide instructions on how I can verify that the code works and achieves the intended goal.
    

Working Environment: I am using a Jupyter notebook for my work.

Repeat back the instructions I have given to ensure understanding.
```
:::

:::{seealso} Code interpreter
:class: dropdown
This is in some ways the easiest option because you can have it generate and run code for you to do all the work. However, you’ll still need to ask questions to make sure it has done the task correctly. Some of this can be alleviated by priming it to reflect on its answers at the beginning of the conversation with something like: “After running code, revisit my question, critically evaluate your approach, and verify if the output achieved the goal.”

Here’s what the complete first prompt could look like:

```
Role: Act as a Data Analysis Copilot providing advice and educational explanations on how to approach my data analysis project.

Responsibilities:

1. Inquire and Clarify: Ask about details that can impact your advice (e.g., data types, dataframe or variable attributes).
    
2. Contextual Understanding: Load and use the attached CSV file as context for answering my questions.
    
3. Direct Responses: Answer my questions directly and do not proceed with additional steps until I explicitly ask.
    
4. Critical Evaluation: After running code, revisit my question, critically evaluate your approach, and verify if the output achieved the goal.
    
5. Instruction Reiteration: Repeat back the instructions I have given to ensure understanding.
    

Working Environment: I am using a Jupyter notebook for my work.

Repeat back the instructions I have given to ensure understanding.
```
:::

::::

## Understand what you're working with


This is a stage in data analysis where the a user's level really makes a difference in how useful AI can be. All groups can leverage GenAI tools for some combinaton of intformation retrieval and soundboarding. Even an expert could benefit from this if they're familiar with the subject matter or the types of analysis that are done in a particular field, but maybe they don't know the specific dataset, or at the very least it can help them organize their thoughts.

::::{tab-set}

:::{tab-item} Tutor (Learning Mode)
:sync: tab-tutor-contextualize

Depending on how unfamiliar you are with the general subject matter and the dataset, you may want to start off very broadly with asking what fields interact and analyze this kind of data, what it is about, what you can learn from it etc. You can ask about how it's formatted and what that means.

Example prompt:
> I don't know anything about the type of data I'm working with here. Can you tell me more about the subject matter and what is represented in the file?

:::

:::{tab-item} Co-pilot (Exploring Mode)
:sync: tab-copilot-contextualize

You may know a little bit about the data. Maybe you've worked with similar things before and you want to think more creatively. You can use AI to engage in some soundboarding about what is typically done vs. what is cutting edge or what could be an innovative approach.

Example prompt: 
> What kinds of analyses are usually done with this data? And what could be an interesting novel way to look at it?

:::

:::{tab-item} Intern (Producing Mode)
:sync: tab-intern-contextualize

If you're familiar with the type, format, and field of the data, as well as the kinds of analyses that are usually done, this could be a good time to state, for the AI 
:::

::::

::::{tab-set}

:::{tab-item} Tutor (Learning Mode)
:sync: tab-tutor-contextualize
- AI suggests key readings on your data's subject area.
- AI outlines common questions to consider for understanding the context.
:::

:::{tab-item} Co-pilot (Exploring Mode)
:sync: tab-copilot-contextualize
- AI helps refine search queries for literature and data sources.
- AI analyzes metadata to give an overview of the data structure.
:::

:::{tab-item} Intern (Producing Mode)
:sync: tab-intern-contextualize
- AI drafts a context summary based on inputs about the problem domain.
- AI pre-processes data for preliminary overview (e.g., missing values, data types).
:::

::::


In this notebook, you'll encounter specific indicators to prompt your actions:

::::{note} **Use GenAI** 
This cue invites you to use GenAI for problem-solving. You can try your own prompt or use the provided example.
:::{dropdown} Example Prompt
:::{code-cell}
This is where you’d find an example prompt you could paste into your GenAI tool of choice.
:::

:::{warning} **Show AI-Solution**
:class: dropdown
If, for any reason, you cannot use a GenAI chatbot, you have the option to refer to a solution generated with our example prompt. However, to maximize learning, we encourage you to generate and apply your own solutions, as the course aims to develop your independent use of AI tools.
:::

:::{tip} **AI-Sandbox**
There will be blocks like this, showing you a code block where you can paste and execute your own AI-generated code. These blocks usually follow guided instructions on the prompts you want to use and the problem you are solving.
:::

In [None]:
# They will be followed by an editable code block like this. You can enter your code from the next line!



You’ve been given a CSV file and are tasked with cleaning the data. While there may be some glaring issues that you can see if you open the file in a spreadsheet editor (like Excel), that’s not always a feasible approach. Sometimes there’s too much data to go through everything manually and sometimes the problems are subtle and can easily be missed by a human observer (human-error is often the reason the issue was there in the first place). 


Let's start by importing the required libraries and loading the CSV file for `shopping_behavior`  dataset in the `Kaggle Ecommerce` and examining the data to identify errors.

First we load the necessary libraries (i.e. the prewritten code from packages). 

In [1]:
# Importing necessary libraries

from pathlib import Path  # Module for handling file paths
import pandas as pd  # Library for data manipulation and analysis
import numpy as np  # Library for numerical computations

These have already been installed here, but if you were to run these locally without having them, you’d get an error and need to install them first.

Then we can read in our CSV file as a DataFrame so we can use it in our Python environment.

In [2]:
# Define the relative path to the dataset CSV file
file_path = Path("..") / "Datasets" / "Kaggle_Ecommerce" / "shopping_behavior.csv"
# Read the CSV file into a pandas DataFrame
shop_behav = pd.read_csv(file_path)
# Display the first 5 rows of the DataFrame
shop_behav.head()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


## 1. Understanding the issue

### A General Checklist for Data Wrangling

It's often useful to go through some basic quality control by going through a checklist of common issues.

:::{note} **Use GenAI** to get a data wrangling checklist
Use GenAI as an educational tool to learn about common data cleaning issues you might face so that you can make a checklist of our possible issues.

Think of a prompt that might give you a good overview of common data cleaning issues.

:::{dropdown} Example Prompt
:::{code-cell}
What are common data cleaning issues I should look out for when preparing a dataset for analysis?

:::

:::{warning} Show AI-Solution
:class: dropdown

Data cleaning is crucial to ensure the integrity of your analysis. Here are some common issues to watch out for:

1. **Missing Values**: Incomplete datasets can lead to biased or incorrect results. You’ll need to decide whether to impute missing values, remove the affected rows, or even ignore them, depending on the situation and the proportion of missing data.
   
2. **Duplicate Data**: Redundant entries can skew your analysis, leading to overestimated significance or effects. Identifying and removing duplicates is essential, especially in datasets where entries should be unique.
   
3. **Inconsistent Formats**: Data collected from different sources or methods might have varying formats. For instance, dates might be recorded as DD-MM-YYYY in one part and MM-DD-YYYY in another. Standardizing these into a single format is necessary for accurate analysis.
   
4. **Outliers and Anomalies**: Extreme values can significantly affect the results of your analysis. Determining whether outliers are due to data entry errors, measurement errors, or genuine extreme variations is important. Sometimes, outliers are insightful; other times, they are just noise.
   
5. **Incorrect Data Types**: Numerical values recorded as strings, categorical data treated as continuous, etc., can mess up data processing. Converting data to appropriate types is crucial.
   
6. **Typos and Spelling Errors**: Errors in categorical data can create artificially inflated categories. For example, 'Brown' vs. 'Borwn' in color categories. These need to be corrected through spelling checks or manual review.
   
7. **Scale and Unit Inconsistencies**: Ensure all data points are measured on the same scale and units (e.g., kilograms vs. pounds, or meters vs. feet) to maintain consistency across the dataset.
   
8. **Encoding Issues**: Problems with character encoding can lead to strange characters appearing in your dataset, especially if the data comes from multiple international sources.
   
9. **Irrelevant or Redundant Features**: Not all features in your dataset may be relevant to your analysis. Identifying and removing irrelevant or redundant features can simplify your model and improve performance.

Addressing these issues thoroughly can enhance the quality of your analysis and help you draw more accurate conclusions from your data.
:::


### Tackling a specific issue: missing values

Now that you have your list of general wrangling/cleaning issues to look out for, it’s time to address these issues. Not all of them will be relevant, but it’s a good idea to check for as many of them as you can. Since we made the dataset, we know what’s wrong with it so you can have the advantage of knowing which issues we’ll deal with!

Let’s walk through solving our first problem together and focus on missing values.

:::{note} **Use GenAI** to understand how missing values are usually handled
Use GenAI to figure out how to find and handle missing values. You can do this in multiple steps (i.e. ask how to identify missing data first, then how to handle it), but since this is a pretty straightforward issue, you can just ask both in one go.
:::{dropdown} Example Prompt
:::{code-cell}
How can I identify and handle missing data in a Python DataFrame?
:::

:::{warning} Show AI-Solution
:class: dropdown
**Identifying and Handling Missing Data in a DataFrame**

**Identifying Missing Data**

To identify missing data in a DataFrame, we can use the `isnull()` or `isna()` method, which will return a DataFrame of the same shape, but with boolean values indicating the presence of missing data.

```python
import pandas as pd

# Sample DataFrame
data = {
    'A': [1, 2, None, 4],
    'B': [None, 2, 3, 4],
    'C': [1, 2, 3, None]
}
df = pd.DataFrame(data)

# Identifying missing data
missing_data = df.isnull()
print(missing_data)
```

To get a summary of the missing data, we can use the `sum()` method along with `isnull()`.

```python
# Summary of missing data
missing_summary = df.isnull().sum()
print(missing_summary)
```

**Handling Missing Data**

There are several strategies to handle missing data:

**Removing Missing Data:**

- **Remove rows with any missing values:**

```python
df_dropped_rows = df.dropna()
print(df_dropped_rows)
```

- **Remove columns with any missing values:**

```python
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns)
```

**Filling Missing Data:**

- **Fill missing values with a specific value:**

```python
df_filled_value = df.fillna(0)
print(df_filled_value)
```

- **Fill missing values using forward fill (propagate last valid observation forward):**

```python
df_filled_forward = df.fillna(method='ffill')
print(df_filled_forward)
```

- **Fill missing values using backward fill (propagate next valid observation backward):**

```python
df_filled_backward = df.fillna(method='bfill')
print(df_filled_backward)
```

- **Fill missing values with the mean of the column:**

```python
df_filled_mean = df.apply(lambda col: col.fillna(col.mean()), axis=0)
print(df_filled_mean)
```

**Choosing the Right Method**

The method you choose to handle missing data depends on the context of your analysis and the nature of the data. For example, if missing values are few and scattered, filling them with the mean might be a good choice. However, if a column has a large proportion of missing values, it might be better to drop that column.

Always consider the impact of your chosen method on the integrity and validity of your analysis.

:::

As you can see from the AI-output, you’ll usually get multiple suggestions on how you can solve the problem.

In this case, your options are to leave it as is (which may not have come up as a suggestion!), drop columns or rows with missing values, or to fill missing values somehow (there are multiple ways to do this, ranging in complexity). 

Since AI is unlikely to question your request, it may be led astray by your question, so this is a great example of cases where you want to think about prompting it *critically*.

## 2. Developing a strategy

:::{note} **Use GenAI** to decide how to handle missing values
Use GenAI to help you make your decision about what to do. Think about how to prompt it so it doesn’t assume you know the right answer.
:::{dropdown} Example Prompt
:::{code-cell}
I've noticed missing values in my dataset and I'm unsure how to handle them. Should I leave them, remove the rows/columns, or fill in the gaps?

Can you provide the pros and cons of each option and what factors I should consider about my data and goals to make the best decision?

To illustrate the full approach, let’s just say that I’ve made the executive decision to remove missing values.

:::

:::{warning} Show AI-Solution
:class: dropdown
**Handling Missing Data: Removing Missing Values**

When dealing with missing data in a dataset, one common approach is to remove the rows or columns containing missing values. Here's a detailed look at the pros, cons, factors to consider, and an example workflow if you decide to remove missing values.

**Pros of Removing Missing Values**

- **Simplicity:** Removing missing values is straightforward and easy to implement.
- **Preserves Data Integrity:** Avoids making assumptions or imputations that could potentially skew analysis.
- **Compatibility:** Ensures compatibility with certain machine learning algorithms that do not handle missing data well.

**Cons of Removing Missing Values**

- **Loss of Data:** Reduces the size of the dataset, potentially leading to loss of valuable information.
- **Bias:** If missing data are not randomly distributed, removing them could bias the analysis.
- **Impact on Representativeness:** Removing too many rows could affect the representativeness of the dataset.

**Factors to Consider**

1. **Amount of Missing Data:**
   Evaluate how much data would be lost by removing missing values. If a significant portion of your data is missing, consider alternative methods like imputation.

2. **Data Distribution:**
   Determine if missing values are randomly distributed or if there's a pattern. Removing them might bias your analysis if there's a systematic reason for missing data.

3. **Impact on Results:**
   Consider the potential impact on your analysis and the goals of your study. Removing missing values may be acceptable if it doesn't compromise the integrity of your findings.

4. **Model Requirements:**
   Check if your machine learning or statistical models require complete datasets or specific preprocessing steps.

**Example Workflow**

Here’s a simplified example of how you might proceed with removing missing values using Python and pandas:

```python
import pandas as pd

# Load your dataset
df = pd.read_csv('your_data.csv')

# Display initial count of missing values
print("Initial count of missing values:\n", df.isnull().sum())

# Remove rows with any missing values
df_clean = df.dropna()

# Display count of missing values after removal
print("Count of missing values after removal:\n", df_clean.isnull().sum())
```

**Conclusion**

By deciding to remove missing values, you've opted for a straightforward approach to handling data gaps. It's essential to weigh the pros and cons carefully based on the specifics of your dataset, your analysis goals, and the requirements of the models you intend to use. 

- **Document Decisions:** Always document your decisions regarding handling missing data to ensure transparency and reproducibility.
- **Consider Sensitivity Analyses:** Conduct sensitivity analyses to assess how robust your findings are to different handling strategies for missing data.

This approach ensures that your dataset is cleaner and suitable for analysis, aligning with your specific analytical goals while maintaining data integrity.

:::

Based on what you've read and seen, you should now have decided on how you want to approach the problem, which means it's time to implement your chosen solution.

## 3. Implementing a solution

:::{note} **Use GenAI** to remove missing values in Python
Having made your decision about how to solve the problem now, you can use GenAI again to generate some concise code to identify rows with missing values and remove them.
:::{dropdown} Example Prompt
:::{code-cell}
Generate Python code that will identify missing values and remove any rows with missing values. 
:::

:::{warning} Show AI-Solution
:class: dropdown
```python
import pandas as pd

# Load your dataset (replace 'your_data.csv' with your actual dataset)
df = pd.read_csv('your_data.csv')

# Identify missing values
print("Missing values before removal:\n", df.isnull().sum())

# Remove rows with any missing values
df_clean = df.dropna()

# Identify missing values after removal (should ideally show zeros)
print("Missing values after removal:\n", df_clean.isnull().sum())

# Optional: Display the cleaned DataFrame
print("Cleaned DataFrame:")
print(df_clean.head())
```


In this code snippet:

- Replace `'your_data.csv'` with the path or name of your actual dataset file.
- `df.isnull().sum()` is used to check the number of missing values in each column before removing any rows.
- `df.dropna()` removes all rows containing any missing values.
- `df_clean.isnull().sum()` checks again for missing values in each column after removal to confirm that there are no missing values left.
- The `print(df_clean.head())` line optionally displays the cleaned DataFrame to verify the results.
:::

:::{tip} AI-Sandbox
Your turn to play around with the code and develop a solution to handle missing values in a DataFrame!
:::

In [None]:
# You can enter your own AI-generated code here




## 4. Troubleshooting

Here is the DataFrame:

If you fill the missing value with mean, you will get 31.5 as the age, while our column is of type `Integer`. You have to be careful with these kinds of issues.

:::{note} **Use GenAI** to troubleshoot data type issues

:::{dropdown} Example Prompt
:::{code-cell}
How do I handle data-type issues when filling in the missing values in a DataFrame?
:::

:::{warning} Show AI-Solution
:class: dropdown

1. **Identify Data-Type Mismatch:**
   Before filling missing values, check the data types of each column using `df.dtypes`. Ensure numeric columns are correctly represented as `int64` or `float64` and not as `object` or other incompatible types.

   ```python
   import pandas as pd
   
   # Example DataFrame
   data = {'A': [1, 2, None, 4, 5],
           'B': ['6', '7', '8', None, '10']}
   
   df = pd.DataFrame(data)
   
   # Display initial data types
   print("Initial data types:")
   print(df.dtypes)
    ```

2. **Convert Data Types if Necessary**

    If numeric columns are stored as strings (`object`), convert them using `pd.to_numeric()` or `astype()` to ensure compatibility for numerical operations and imputation.

```python
    # Convert 'B' column from string to numeric (int or float)
    df['B'] = pd.to_numeric(df['B'], errors='coerce')
    
    # Display data types after conversion
    print("\nData types after conversion:")
    print(df.dtypes)
```

3. **Handle Missing Values**

Choose an appropriate strategy (mean, median, mode, interpolation, constant) to fill missing values (`NaN`) in numeric columns after ensuring consistent data types.

```python
# Fill missing values in 'A' column with mean
mean_A = df['A'].mean()
df['A'] = df['A'].fillna(mean_A)

# Fill missing values in 'B' column with median (after conversion)
median_B = df['B'].median()
df['B'] = df['B'].fillna(median_B)

# Display DataFrame after filling missing values
print("\nDataFrame after filling missing values:")
print(df)
```

4. **Validate Data Types After Imputation**

After filling missing values, verify the data types of columns using `df.dtypes` to ensure consistency and compatibility with your analysis requirements.

```python
# Validate data types after filling missing values
print("\nData types after filling missing values:")
print(df.dtypes)
```

By following these steps in your data preprocessing pipeline, you can effectively handle data-type issues when filling missing values in a DataFrame, ensuring data integrity and compatibility for subsequent analysis or modeling tasks.

:::

# DIY Data Wrangling with GenAI

Here are some tasks ranked from simple to advanced that you can tackle with GenAI solutions:

:::{tip} AI-Sandbox
1. **Duplicates Removal**
   - **Problem:** Identify and remove duplicate rows from a dataset.
  
:::{dropdown} Example Prompt
:::{code-cell}
I want to identify and remove duplicate rows from a dataset. 

How do I do that in Python with my dataset?
:::

In [None]:
# You can enter your own AI-generated code here




2. **Creating New Columns**
   - **Problem:** Combine existing columns (e.g., 'LATITUDE' and 'LONGITUDE') into a new column ('COORDINATES').
   - **Prompt:** Help me in creating a new column named 'COORDINATES' by combining the 'LATITUDE' and 'LONGITUDE' columns as strings. Provide a solution using string manipulation functions to concatenate values from multiple columns.

3. **Merging CSVs**
   - **Problem:** Combine multiple CSV files located in a directory ('../Datasets/NOAA_Weather') into a single dataset.
   - **Prompt:** How can I merge multiple CSV files located in a directory into a single dataset? Utilize file handling and data integration techniques to seamlessly combine separate datasets for comprehensive analysis.