# Day 1: Data Loading and Basic Exploration

## Task: 

- Load a dataset into a Pandas DataFrame and perform basic exploration.

## Description: 

- Use Pandas' read_csv() function. 
- Display the first few rows of the dataset using the head() method to understand its structure and format. 
- Check the data types of each column using the info() method to identify any inconsistencies or missing values. 
- Calculate summary statistics for numerical columns using the describe() method to gain insights into the data distribution. 
- Identify missing values in the dataset using methods like isna().

### Package Imports

In [15]:
# Data analysis packages
import pandas as pd
import numpy as np

### Loading the Dataset

In [16]:
# Read the dataset via Pandas
df = pd.read_csv('Mental Health Dataset.csv')

### EDA

In [17]:
print("First few rows of the dataset:")
df.head()

First few rows of the dataset:


Unnamed: 0,Timestamp,Gender,Country,Occupation,self_employed,family_history,treatment,Days_Indoors,Growing_Stress,Changes_Habits,Mental_Health_History,Mood_Swings,Coping_Struggles,Work_Interest,Social_Weakness,mental_health_interview,care_options
0,2014-08-27 11:29:31,Female,United States,Corporate,,No,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Not sure
1,2014-08-27 11:31:50,Female,United States,Corporate,,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,No
2,2014-08-27 11:32:39,Female,United States,Corporate,,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Yes
3,2014-08-27 11:37:59,Female,United States,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,Maybe,Yes
4,2014-08-27 11:43:36,Female,United States,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Yes


### Understanding the Data Dypes

In [18]:
# Obtain dataset information
print("\nDataset Information:")
df.info()


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292364 entries, 0 to 292363
Data columns (total 17 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   Timestamp                292364 non-null  object
 1   Gender                   292364 non-null  object
 2   Country                  292364 non-null  object
 3   Occupation               292364 non-null  object
 4   self_employed            287162 non-null  object
 5   family_history           292364 non-null  object
 6   treatment                292364 non-null  object
 7   Days_Indoors             292364 non-null  object
 8   Growing_Stress           292364 non-null  object
 9   Changes_Habits           292364 non-null  object
 10  Mental_Health_History    292364 non-null  object
 11  Mood_Swings              292364 non-null  object
 12  Coping_Struggles         292364 non-null  object
 13  Work_Interest            292364 non-null  object
 14

### What Does This Tell Me?

- There are 17 columns and 292364 rows,
- All data types are objects
- 'non-null' refers to the no. of rows in the DataFrame that contain valid data for each column.
- There are NaN values but this requires additional analysis.

### Summary Statistics

In [19]:
# .describe() to obtain the summary statistics
print("\nSummary Statistics:")
df.describe()


Summary Statistics:


Unnamed: 0,Timestamp,Gender,Country,Occupation,self_employed,family_history,treatment,Days_Indoors,Growing_Stress,Changes_Habits,Mental_Health_History,Mood_Swings,Coping_Struggles,Work_Interest,Social_Weakness,mental_health_interview,care_options
count,292364,292364,292364,292364,287162,292364,292364,292364,292364,292364,292364,292364,292364,292364,292364,292364,292364
unique,734,2,35,5,2,2,2,5,3,3,3,3,2,3,3,3,3
top,2014-08-27 12:31:41,Male,United States,Housewife,No,No,Yes,1-14 days,Maybe,Yes,No,Medium,No,No,Maybe,No,No
freq,780,239850,171308,66351,257994,176832,147606,63548,99985,109523,104018,101064,154328,105843,103393,232166,118886


### What Does This Tell Me?

- The "Timestamp" column contains 292,364 non-null values, which are all objects (strings or datetime values). The unique values in this column indicate that the timestamps range from 2014-08-27 11:29:31 to 2014-08-27 12:31:41.

- The "Gender" column contains 292,364 non-null values, which are all objects. There are only 2 unique values in this column, "Female" and "Male", with 239,850 rows having "Female" and 52,514 rows having "Male".

- The "Country" column contains 292,364 non-null values, which are all objects. There are 35 unique values in this column, with "United States" being the most frequent value, appearing 171,308 times.

- The "Occupation" column contains 292,364 non-null values, which are all objects. There are 5 unique values in this column, with "Corporate" being the most frequent value, appearing 101,020 times.

- The "self_employed" column contains 287,162 non-null values, which are all objects. There are 2 unique values in this column, "No" and "Yes", with 257,994 rows having "No" and 24,168 rows having "Yes".

- The "family_history" column contains 292,364 non-null values, which are all objects. There are 2 unique values in this column, "No" and "Yes", with 176,832 rows having "No" and 115,532 rows having "Yes".

- The "treatment" column contains 292,364 non-null values, which are all objects. There are 2 unique values in this column, "No" and "Yes", with 147,606 rows having "No" and 144,758 rows having "Yes".

- The "Days_Indoors" column contains 292,364 non-null values, which are all objects. There are 5 unique values in this column, "1-14 days", "15-21 days", "22-28 days", "> 28 days", and "Not sure", with 147,606 rows having "1-14 days" and 52,824 rows having "15-21 days".

- The "Growing_Stress" column contains 292,364 non-null values, which are all objects. There are 3 unique values in this column, "Yes", "No", and "Maybe", with 171,308 rows having "Yes", which appears 99,985 times.

- Changes_Habits: This column contains 292,364 non-null values, with 3 unique values (Yes, No, and NaN). The most common value is "Yes", which appears 109,523 times.

- Mental_Health_History: This column contains 292,364 non-null values, with 3 unique values (Yes, No, and NaN). The most common value is "No", which appears 104,018 times.

- Mood_Swings: This column contains 292,364 non-null values, with 3 unique values (Medium, Low, and NaN). The most common value is "Medium", which appears 101,064 times.

- Coping_Struggles: This column contains 292,364 non-null values, with 2 unique values (No and Yes). The most common value is "No", which appears 154,328 times.

- Work_Interest: This column contains 292,364 non-null values, with 3 unique values (Maybe, No, and NaN). The most common value is "No", which appears 105,843 times.

- Social_Weakness: This column contains 292,364 non-null values, with 3 unique values (No, Maybe, and NaN). The most common value is "No", which appears 103,393 times.

- mental_health_interview: This column contains 292,364 non-null values, with 3 unique values (No, Maybe, and NaN). The most common value is "No", which appears 232,166 times.

### Missing Value Analysis

In [20]:
# Identify NaN values in table format
df.isna()

Unnamed: 0,Timestamp,Gender,Country,Occupation,self_employed,family_history,treatment,Days_Indoors,Growing_Stress,Changes_Habits,Mental_Health_History,Mood_Swings,Coping_Struggles,Work_Interest,Social_Weakness,mental_health_interview,care_options
0,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292359,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
292360,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
292361,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
292362,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


### What Does This Tell Me?

- Having the data for the NaN values in this form is not very helpful, 
- It would be easier to count the values per column.

In [21]:
# Count NaN values in each column
nan_values_count = df.isna().sum()

print("\nCount of NaN values in each column:")

# Print the result
print(nan_values_count)


Count of NaN values in each column:
Timestamp                     0
Gender                        0
Country                       0
Occupation                    0
self_employed              5202
family_history                0
treatment                     0
Days_Indoors                  0
Growing_Stress                0
Changes_Habits                0
Mental_Health_History         0
Mood_Swings                   0
Coping_Struggles              0
Work_Interest                 0
Social_Weakness               0
mental_health_interview       0
care_options                  0
dtype: int64


### What Does This Tell Me?

- Only one column ('self_employed') contained NaN values,
- 5202 NaN in column/dataset.