# Assignment 1: Locating and Exploring Datasets

Welcome to the first assignment of the course 'AI-powered Data Analysis'. In this assignment, you will learn how to locate and open datasets stored as CSV files, understand their structure, and explore their components.

*Again, you are not required to write any code; just run the cells and observe the outputs, except for the 'Reflection Prompt' section at the end.*


## Introduction

In this assignment, we will explore the process of data analysis, starting from the very basics of locating and opening datasets saved as CSV files, understanding their structure, and exploring their components.

### Importance of Metadata

When given a dataset, the first step is to look at the metadata. Metadata provides crucial information about the data, such as:
- The structure of the dataset (e.g., column names, data types)
- Descriptions of the data fields
- Information about the source and context of the data

Understanding the metadata is beneficial because:
- It helps in comprehending the dataset's structure and content.
- It provides insights into the data quality and potential preprocessing steps needed.
- It aids in planning the data analysis and visualization strategies effectively.

### Datasets Overview

We will be using three datasets in this assignment:

1. **NOAA Weather Dataset**
    - This dataset contains weather data collected by the National Oceanic and Atmospheric Administration (NOAA).
    - [Link to Metadata](#)

2. **Kaggle Ecommerce Dataset**
    - This dataset comprises e-commerce data from Kaggle, including information about transactions, products, and customers.
    - [Link to Metadata](#)

3. **Yelp Reviews Dataset**
    - This dataset includes reviews from Yelp, with columns such as 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time', 'Summary', and 'Text'.
    - [Link to Metadata](#)

After understanding the metadata and the structure of our datasets, we will proceed with coding to explore and analyze the data.


### Datasets Information

In this course, we have three different datasets housed within the following file directory structure:

```
Datasets/
│
├── NOAA_Weather/
│   ├── 31285099999.csv
│   ├── 72484653123.csv
│   └── 99999926563.csv
│
├── Kaggle_Ecommerce/
│   ├── bank_churners.csv
│   ├── shopping_behavior.csv
│
└── Yelp_Reviews/
    ├── reviews.csv
```

As you can see, a dataset may contain multiple CSV files. For this assignment, we will be using the `bank_churners.csv` file in the `Kaggle_Ecommerce` folder.


## Introduction to Dataset Overview and Metadata

Before diving into data analysis, it is crucial to get an overview of the dataset and understand its metadata. This initial step provides valuable insights into the data's structure, quality, and the types of information it contains.

### Why Summarize the Dataset Overview?

- **Quick Insights:** A summary gives a quick glance at the main features of the dataset.
- **Data Quality:** Helps in identifying any immediate data quality issues.
- **Preparation for Analysis:** Prepares the ground for more detailed data analysis and visualization.

<div style="background-color: #ADD8E6; padding: 10px;">

🤖 
<br>
**Before starting, it would be beneficial to use a Generative AI platform of your choice and inquire about the relevant information you can gather about a dataset**

This is one of the many 'Ask Gen-AI' prompts included throughout the assignments. When you see a cell with a blue background, take a moment to pause and try out the prompt and its variations. This practice will enhance your ability to ask precise questions to AI, ultimately making your work more efficient and streamlined.

</div>

## 1. Locating and Opening CSV Files

First, let's locate and open the CSV files. We'll use the Kaggle Ecommerce Data for this exercise.

### List all files in the `Kaggle Ecommerce` folder
We will use the `os` module to list all files in the folder.

1. **Import the `os` module**

```python
import os
```

The `os` module provides functions to interact with the operating system, such as listing files in a directory.

2. **List all files in the `Kaggle Ecommerce` folder**

```python
files = os.listdir('../Datasets/Kaggle_Ecommerce')
files
```

- `os.listdir('../Datasets/Kaggle_Ecommerce')` lists all files and directories in the `Kaggle_Ecommerce` folder.
- The result is stored in the variable `files`.
- `files` is then displayed to show the list of files.

In [1]:
import os

# List all files in the folder
files = os.listdir('../Datasets/Kaggle_Ecommerce')
files

['bank_churners.csv', 'shopping_behavior.csv']

### Explanation of the above code cell
The above code cell imports the `os` module and lists all files in the `Kaggle_Ecommerce` folder. The `os.listdir()` function retrieves the names of all files and directories in the specified folder. The result is stored in the variable `files`, which is then displayed. This helps in identifying the available CSV files for further analysis.

### Read and display the first few rows of a CSV file
We will use the `pandas` library to read the CSV file and display its first few rows.

1. **Import the `pandas` library**

```python
import pandas as pd
```

The `pandas` library is used for data manipulation and analysis.

2. **Read the first CSV file**

```python
file_path = os.path.join('../Datasets/Kaggle_Ecommerce', files[0])
```

- `os.path.join('../Datasets/Kaggle_Ecommerce', files[0])` creates the full path to the first file in the `Kaggle_Ecommerce` folder.
- The result is stored in the variable `file_path`.

3. **Load the CSV file into a DataFrame**

```python
df = pd.read_csv(file_path)
```

- `pd.read_csv(file_path)` reads the CSV file and loads its contents into a `pandas` DataFrame.
- The DataFrame is stored in the variable `df`.

4. **Display the first few rows of the DataFrame**

```python
df.head()
```

- `df.head()` displays the first five rows of the DataFrame.

In [2]:
import pandas as pd

# Read the bank_churners CSV file
file_path = os.path.join('../Datasets/Kaggle_Ecommerce', files[0])
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0


The first five rows of the DataFrame are displayed using `df.head()`. This provides a quick overview of the data structure and the initial few records.


### Get the number of rows in the DataFrame

We will use the `len()` function to get the number of rows in the DataFrame.

1. **Get the number of rows**

```python
num_rows = len(df)
num_rows

In [3]:
num_rows = len(df)
num_rows

10127

This number represents the total number of records in the dataset. This information is crucial for understanding the dataset's size.

## 2. Understanding the Structure

Let's get more information about the dataset to understand its structure better.

### Get the column names in the DataFrame

We will use the `.columns` attribute of the DataFrame to get the column names.

1. **Get the column names**

```python
columns = df.columns
columns

In [4]:
columns = df.columns
columns

Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')

The list of items inside the square brackets `[]` is the columns that the DataFrame has. This allows you to see all the column names at a glance. This can be especially helpful when dealing with large datasets or when you want to programmatically interact with the columns.

### Get basic information about the dataframe
We will use the `info()` method to get a concise summary of the dataframe, including the number of non-null values and data types of each column.

1. **Get a concise summary of the DataFrame**

```python
df.info()
```

- `df.info()` displays a concise summary of the DataFrame.
- It shows the number of non-null values, data types of each column, and memory usage.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           10127 non-null  object 
 6   Marital_Status            10127 non-null  object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_

**A detailed breakdown of the above output is:**


**Class Type:**

- The first line `<class 'pandas.core.frame.DataFrame'>` tells you that the object is a DataFrame.

**Index Range:**

- `RangeIndex: 10127 entries, 0 to 10126` indicates that the DataFrame has an Index with 10127 entries ranging from 0 to 10126.

**Column Information:**

- `Data columns (total 21 columns):` indicates that there are 21 columns in the DataFrame.

**Column Details:**

- For each column, you get the column number (starting from 0), the column name, the count of non-null values, and the data type
- `non-null` means there are no null entries in the column

**Data Types Summary**

- `dtypes: float64(5), int64(10), object(6)` summarizes the data types present in the DataFrame and their counts, `object` in Pandas represents a string.

**Memory** 
- `memory usage: ... MB` indicates the memory usage of the DataFrame.

<div style="background-color: #ADD8E6; padding: 10px;">

🤖
## 3. Try It Yourself: Exploring Other Datasets

Now that you have learned how to locate, open, and explore a CSV file, it's time for you to practice these steps with other datasets.
Follow the instructions below and use Generative AI for any help you might need.

### Instructions:

1. **Choose a Dataset**:
   - Navigate to the `Datasets` folder and choose a different dataset. For example, you can use the `NOAA_Weather` or `Yelp_Reviews` folder.


2. **List All Files in the Chosen Folder**:
   - Use the `os` module to list all files in the chosen folder. Here’s a template to get you started:
   ```python
   import os

   # List all files in the chosen folder
   folder_path = '../Datasets/NOAA_Weather'  # Change this to your chosen folder
   files = os.listdir(folder_path)
   print(files)


3. **Get your hands dirty**:
   - You can use the commands shown above, or something new that you might have gotten from Generative AI
</div>

## 4. Reflection Exercise

In this section, you will reflect on your learning experience and answer the following questions. Please provide your answers when prompted by the code.

### Questions:

1. **What would you describe as your field? Use a 1-2 word description.**
2. **What is something that would be interesting for you to measure or track in your discipline?**

Run the following two cells and input your answers in the text boxes that show up. Don't forget to press `Return` once you are done typing. In case you made a mistake or want to re-enter your answer, just run the corresponding cell again.

In [6]:
field = input("What would you describe as your field? Use a 1-2 word description.")

What would you describe as your field? Use a 1-2 word description. I am a primary school teacher.


In [7]:
measure = input("What is something that would be interesting for you to measure or track in your discipline?")

What is something that would be interesting for you to measure or track in your discipline? Student engagement during different times of the day.


Running the following cell is really important, as this will save the answers you gave above. In case you change your answers to any of the above questions, please be sure to run the following cell again, to save the updated answers.

You needn't be too concerned with what this code is really doing, but the crux is it is doing some string operations to convert your answers to a prompt, and then save that to a `.txt` file.

<div style="background-color: #ADD8E6; padding: 10px;">

🤖
<br>
You can give this code to Generative AI and ask what each of the lines are actually doing!

</div>

In [8]:
with open('..\Prompts\Assignment_1_given.txt', 'r') as file:
    prompt = file.read()

prompt = prompt.replace('***', field, 1)
prompt = prompt.replace('***', measure, 1)

with open('..\Prompts\Assignment_1.txt', 'w') as file:
    file.write(prompt)

# Summary

In this assignment, you have learned how to:
- Locate and open CSV files
- Understand the structure of a dataset
- Explore the components of each dataset, such as column names and their respective data types

This foundational knowledge is crucial for effective data analysis. Make sure you understand each step and feel free to explore further by modifying the cells and running them again.