## ***Distribution of Categorical and Continuous Variables: Visualization***

___
### Introduction to the Dataset

The **Adult dataset**, also known as the **Census Income dataset**, was originally collected by the U.S. Census Bureau. It contains demographic information about individuals from the United States, and its primary purpose is to predict whether a person earns more than $50K per year based on various attributes. The data comes from the U.S. Census and includes individuals from different regions across the country, representing a sample of U.S. citizens and reflecting socio-economic trends at the time of data collection.

The dataset includes **15 columns** with both categorical and continuous variables that represent key socio-economic characteristics, such as age, education, occupation, marital status, work hours, and other demographic factors.

### Aim of the Project

The goal of this project is to visualize the distribution of various variables in the Adult dataset. Specifically, we aim to:

- Create **bar charts** to visualize the distribution of categorical variables like gender and marital status. These visualizations will help in understanding how different categories of these variables are represented in the dataset.
- Create **histograms** to show the distribution of continuous variables like age and hours per week. This will allow us to analyze the spread, central tendencies, and ranges of these variables, providing insights into the general demographic and working patterns of the individuals in the dataset.

In this project, we will focus on the following variables:

- **Categorical Variables**: Gender, Marital Status
- **Continuous Variables**: Age, Hours per Week
___

### 1,Improt necessary libraries

In [71]:
import pandas as pd

### 2,Load the dataset

In [72]:
df_adult=pd.read_csv("Adult_data.csv")
print(df_adult.head())

   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country salary  
0          2174             0              40  United-States  <=50K  
1             0             0             

### 3,Data Cleaning

#### _Concise summary of the dataset_

In [73]:
df_adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


___
### _Observation_
- The dataset contains 32,561 entries and 15 columns.
- 6 columns are numerical (int64).
- 9 columns are categorical (object).
- All columns have non-null values, meaning there are no missing entries.
___

### Check and handle for duplicated rows

In [74]:
df_adult.duplicated().sum()

24

### _Selecting a Subset of Columns from the Dataset_

In [75]:
# Keep only the relevant columns
df_subset = df_adult[['sex', 'marital-status', 'age', 'hours-per-week']]

# Display the first few rows of the reduced dataset to check
print(df_subset.head())

      sex      marital-status  age  hours-per-week
0    Male       Never-married   39              40
1    Male  Married-civ-spouse   50              13
2    Male            Divorced   38              40
3    Male  Married-civ-spouse   53              40
4  Female  Married-civ-spouse   28              40


___
#### _Reason for Taking a Subset of Columns_

The decision to take a subset of columns from the original dataset is driven by the aim of the project, which focuses on visualizing the distributions of two categorical variables (sex and marital status) and two continuous variables (age and hours per week). By selecting only these variables, we can streamline our analysis to directly address the project objectives.
___
