<a href="https://colab.research.google.com/github/manas-shukla-101/Explortory-Data-Analysis/blob/main/BBookEDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import files
uploaded = files.upload() #used to upload dataset in colab

## Initial Data Inspection


In [None]:
import pandas as pd

df = pd.read_csv('BBook - Form Responses 1.csv')

print("First 5 rows of the DataFrame:")
print(df.head())
print("\nDataFrame Info:")
df.info()

print("\nDescriptive Statistics:")
print(df.describe())

## Data Cleaning

Identify and handle missing values, duplicate entries, and potential inconsistencies in the dataset.

In [None]:
new_column_names = {
    'Age:': 'Age',
    'Please specify your residential area: ': 'ResidentialArea',
    '1. What is your primary mode of daily commute?': 'PrimaryCommuteMode',
    '2. What is your typical one-way commute time?': 'CommuteTime',
    '3. How many days per week do you commute?': 'DaysPerWeekCommute',
    '4. How often do you encounter significant traffic congestion?': 'TrafficCongestionFrequency',
    '5. How much extra time does congestion add to your daily commute?': 'ExtraTimeCongestion',
    '6. What is the main cause of congestion in your area?': 'MainCauseCongestion',
    '7. Which solution would most effectively reduce congestion?': 'EffectiveSolutionCongestion',
    '8. Would you switch to public transport if it were more reliable?': 'SwitchToPublicTransport'
}
df = df.rename(columns=new_column_names)

df.columns = df.columns.str.strip().str.replace('[^A-Za-z0-9_]+', '', regex=True)

print("DataFrame columns after final renaming and stripping:")
print(df.columns.tolist())

In [None]:
df = df.rename(columns={'WhatIsYourTypicalOnewayCommuteTime': 'CommuteTime'})

duplicate_rows = df.duplicated().sum()
print(f"Number of duplicate rows found: {duplicate_rows}")

if duplicate_rows > 0:
    df.drop_duplicates(inplace=True)
    print(f"Duplicate rows removed. DataFrame shape after removing duplicates: {df.shape}")
else:
    print("No duplicate rows to remove.")

print("\nDataFrame columns after final renaming:")
print(df.columns.tolist())

In [None]:
print("Missing values before handling:")
print(df.isnull().sum())

missing_cols = df.columns[df.isnull().any()].tolist()

for col in missing_cols:
    if df[col].isnull().any():
        mode_value = df[col].mode()[0]
        df[col] = df[col].fillna(mode_value)
        print(f"Filled missing values in column '{col}' with mode: {mode_value}")

print("\nMissing values after handling:")
print(df.isnull().sum())

##Data Transformation

Transforming catagorical data for furthur analysis.


In [None]:
import pandas as pd

age_categories = ['Below 18', '18-30', '30-45', '>45']
df['Age'] = pd.Categorical(df['Age'], categories=age_categories, ordered=True)
df['Age'] = df['Age'].cat.codes

commute_time_categories = ['Less than 30 minutes', '30-60 minutes', '1-2 hours', 'More than 2 hours']
df['CommuteTime'] = pd.Categorical(df['CommuteTime'], categories=commute_time_categories, ordered=True)
df['CommuteTime'] = df['CommuteTime'].cat.codes

days_commute_categories = ['1-3 days', '4-5 days', '6-7 days']
df['DaysPerWeekCommute'] = pd.Categorical(df['DaysPerWeekCommute'], categories=days_commute_categories, ordered=True)
df['DaysPerWeekCommute'] = df['DaysPerWeekCommute'].cat.codes

traffic_freq_categories = ['Rarely/Never', '1-2 times per week', '3-5 times per week', 'Daily']
df['TrafficCongestionFrequency'] = pd.Categorical(df['TrafficCongestionFrequency'], categories=traffic_freq_categories, ordered=True)
df['TrafficCongestionFrequency'] = df['TrafficCongestionFrequency'].cat.codes

extra_time_categories = ['Less than 15 minutes', '15-30 minutes', '30-60 minutes', 'More than 1 hour']
df['ExtraTimeCongestion'] = pd.Categorical(df['ExtraTimeCongestion'], categories=extra_time_categories, ordered=True)
df['ExtraTimeCongestion'] = df['ExtraTimeCongestion'].cat.codes

switch_transport_categories = ['Definitely not', 'Probably not', 'Not sure', 'Probably yes', 'Yes, definitely']
df['SwitchToPublicTransport'] = pd.Categorical(df['SwitchToPublicTransport'], categories=switch_transport_categories, ordered=True)
df['SwitchToPublicTransport'] = df['SwitchToPublicTransport'].cat.codes

print("DataFrame after ordinal encoding:")
print(df.head())

print("\nDataFrame Info after transformations:")
df.info()

## Exploratory Data Analysis (EDA) & Hypothesis Generation

Conduct in-depth exploratory data analysis using various visualizations to identify key patterns and relationships in the dataset.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported and default figure size set.")

In [None]:
ordinal_features = [
    'Age',
    'CommuteTime',
    'DaysPerWeekCommute',
    'TrafficCongestionFrequency',
    'ExtraTimeCongestion',
    'SwitchToPublicTransport'
]

plt.figure(figsize=(15, 10))
for i, col in enumerate(ordinal_features):
    plt.subplot(2, 3, i + 1)
    sns.countplot(data=df, x=col, hue=col, palette='viridis', legend=False)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("Count plots for ordinal encoded features displayed.")

In [None]:
nominal_features = [
    'ResidentialArea',
    'PrimaryCommuteMode',
    'MainCauseCongestion',
    'EffectiveSolutionCongestion'
]

plt.figure(figsize=(20, 15))
for i, col in enumerate(nominal_features):
    plt.subplot(2, 2, i + 1)
    sns.countplot(data=df, y=col, order=df[col].value_counts().index, hue=col, palette='viridis', legend=False)
    plt.title(f'Distribution of {col}')
    plt.xlabel('Count')
    plt.ylabel(col)
plt.tight_layout()
plt.show()

print("Count plots for nominal categorical features displayed.")

## Visualize 'PrimaryCommuteMode' vs. 'TrafficCongestionFrequency'

Visualize the relationship between 'PrimaryCommuteMode' and 'TrafficCongestionFrequency' using a grouped bar chart to understand how different commute modes correlate with congestion frequency.


In [None]:
traffic_freq_categories = ['Rarely/Never', '1-2 times per week', '3-5 times per week', 'Daily']
df['TrafficCongestionFrequency_labels'] = df['TrafficCongestionFrequency'].map(lambda x: traffic_freq_categories[x])

plt.figure(figsize=(12, 7))
sns.countplot(
    data=df,
    x='PrimaryCommuteMode',
    hue='TrafficCongestionFrequency_labels',
    palette='viridis',
    order=df['PrimaryCommuteMode'].value_counts().index
)
plt.title('Primary Commute Mode vs. Traffic Congestion Frequency')
plt.xlabel('Primary Commute Mode')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Traffic Congestion Frequency')
plt.tight_layout()
plt.show()

df.drop(columns=['TrafficCongestionFrequency_labels'], inplace=True)

print("Grouped bar chart for 'PrimaryCommuteMode' vs. 'TrafficCongestionFrequency' displayed.")

### Formulated Hypotheses

Based on the comprehensive Exploratory Data Analysis, especially considering the distributions of ordinal and nominal features and the relationship between 'PrimaryCommuteMode' and 'TrafficCongestionFrequency', the following hypotheses are formulated:

1.  **Hypothesis 1 (Commute Mode and Congestion Frequency)**: Respondents who primarily use public transport experience lower frequencies of significant traffic congestion compared to those using personal vehicles or auto-rickshaws/taxis.
    *   **Reasoning**: The grouped bar chart for 'PrimaryCommuteMode' vs. 'TrafficCongestionFrequency' shows that while 'Public Transport (Bus/Local/Metro)' users still experience congestion, the proportion reporting 'Daily' congestion might be lower than for 'Personal Vehicle (Car/Bike/Scooter)' users or 'Auto-rickshaw/Taxi' users, who seem to have a higher share in the 'Daily' and '3-5 times per week' categories.

2.  **Hypothesis 2 (Congestion Cause and Extra Commute Time)**: 'Poor road infrastructure' and 'Too many private vehicles' are perceived as the main causes of congestion, and these causes are associated with higher reported 'ExtraTimeCongestion' (e.g., '30-60 minutes' or 'More than 1 hour').
    *   **Reasoning**: The 'MainCauseCongestion' distribution highlights these as prominent causes. Further analysis would involve correlating these specific causes with the 'ExtraTimeCongestion' categories to see if they indeed lead to longer delays.

3.  **Hypothesis 3 (Age Group and Switch to Public Transport)**: Younger age groups (e.g., '18-30' and '30-45') are more likely to express a willingness to switch to public transport if it were more reliable compared to older age groups ('>45').
    *   **Reasoning**: The 'Age' distribution shows a high concentration in the '18-30' and '30-45' categories. The 'SwitchToPublicTransport' distribution indicates a strong inclination towards switching ('Yes, definitely', 'Probably yes'). Investigating the cross-tabulation of these two variables might reveal a stronger positive sentiment towards switching among younger respondents.

## Test Hypothesis 1 (Commute Mode and Congestion Frequency)

Test the hypothesis that respondents who primarily use public transport experience lower frequencies of significant traffic congestion compared to those using personal vehicles or auto-rickshaws/taxis, using a Chi-square test and a proportional stacked bar chart.


In [None]:
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['PrimaryCommuteMode'], df['TrafficCongestionFrequency'])
print("Contingency Table (PrimaryCommuteMode vs. TrafficCongestionFrequency):\n", contingency_table)

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square Statistic: {chi2:.2f}")
print(f"P-value: {p_value:.3f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:\n", pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

traffic_freq_categories = ['Rarely/Never', '1-2 times per week', '3-5 times per week', 'Daily']
df['TrafficCongestionFrequency_labels'] = df['TrafficCongestionFrequency'].map(lambda x: traffic_freq_categories[x])

contingency_table_proportional = pd.crosstab(df['PrimaryCommuteMode'], df['TrafficCongestionFrequency_labels'])
contingency_table_proportional = contingency_table_proportional.apply(lambda r: r/r.sum(), axis=1)

contingency_table_proportional = contingency_table_proportional[traffic_freq_categories]

contingency_table_proportional.plot(kind='bar', stacked=True, figsize=(12, 7), cmap='viridis')
plt.title('Proportional Distribution of Traffic Congestion Frequency by Primary Commute Mode')
plt.xlabel('Primary Commute Mode')
plt.ylabel('Proportion')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Traffic Congestion Frequency', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

df.drop(columns=['TrafficCongestionFrequency_labels'], inplace=True)

print("Chi-square test performed and proportional stacked bar chart displayed.")

## Test Hypothesis 2 (Congestion Cause and Extra Commute Time)

Test the hypothesis that 'Poor road infrastructure' and 'Too many private vehicles' as main causes are associated with higher reported 'ExtraTimeCongestion'. This will involve performing Chi-square tests and generating proportional stacked bar charts for each specific cause against 'ExtraTimeCongestion'.


In [None]:
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns

extra_time_categories = ['Less than 15 minutes', '15-30 minutes', '30-60 minutes', 'More than 1 hour']

df['ExtraTimeCongestion_labels'] = df['ExtraTimeCongestion'].map(lambda x: extra_time_categories[x])

df['IsPoorRoadInfrastructure'] = df['MainCauseCongestion'].str.contains('Poor road infrastructure', case=False, na=False)

contingency_table_infra = pd.crosstab(df['IsPoorRoadInfrastructure'], df['ExtraTimeCongestion'])
print("\nContingency Table (IsPoorRoadInfrastructure vs. ExtraTimeCongestion):\n", contingency_table_infra)

chi2_infra, p_value_infra, dof_infra, expected_infra = chi2_contingency(contingency_table_infra)

print(f"\nChi-square Statistic (Poor Road Infrastructure): {chi2_infra:.2f}")
print(f"P-value (Poor Road Infrastructure): {p_value_infra:.3f}")
print(f"Degrees of Freedom (Poor Road Infrastructure): {dof_infra}")

contingency_table_infra_proportional = pd.crosstab(df['IsPoorRoadInfrastructure'], df['ExtraTimeCongestion_labels'])
contingency_table_infra_proportional = contingency_table_infra_proportional.apply(lambda r: r/r.sum(), axis=1)
contingency_table_infra_proportional = contingency_table_infra_proportional[extra_time_categories]

plt.figure(figsize=(10, 6))
contingency_table_infra_proportional.plot(kind='bar', stacked=True, figsize=(10, 6), cmap='viridis', ax=plt.gca())
plt.title('Proportional Extra Time Due to Congestion by Poor Road Infrastructure Perception')
plt.xlabel('Main Cause: Poor Road Infrastructure')
plt.ylabel('Proportion')
plt.xticks(rotation=0)
plt.legend(title='Extra Time Congestion', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

df['IsTooManyPrivateVehicles'] = df['MainCauseCongestion'].str.contains('Too many private vehicles', case=False, na=False)

contingency_table_vehicles = pd.crosstab(df['IsTooManyPrivateVehicles'], df['ExtraTimeCongestion'])
print("\nContingency Table (IsTooManyPrivateVehicles vs. ExtraTimeCongestion):\n", contingency_table_vehicles)

chi2_vehicles, p_value_vehicles, dof_vehicles, expected_vehicles = chi2_contingency(contingency_table_vehicles)

print(f"\nChi-square Statistic (Too Many Private Vehicles): {chi2_vehicles:.2f}")
print(f"P-value (Too Many Private Vehicles): {p_value_vehicles:.3f}")
print(f"Degrees of Freedom (Too Many Private Vehicles): {dof_vehicles}")

contingency_table_vehicles_proportional = pd.crosstab(df['IsTooManyPrivateVehicles'], df['ExtraTimeCongestion_labels'])
contingency_table_vehicles_proportional = contingency_table_vehicles_proportional.apply(lambda r: r/r.sum(), axis=1)
contingency_table_vehicles_proportional = contingency_table_vehicles_proportional[extra_time_categories]

plt.figure(figsize=(10, 6))
contingency_table_vehicles_proportional.plot(kind='bar', stacked=True, figsize=(10, 6), cmap='viridis', ax=plt.gca())
plt.title('Proportional Extra Time Due to Congestion by Too Many Private Vehicles Perception')
plt.xlabel('Main Cause: Too Many Private Vehicles')
plt.ylabel('Proportion')
plt.xticks(rotation=0)
plt.legend(title='Extra Time Congestion', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

df.drop(columns=['ExtraTimeCongestion_labels', 'IsPoorRoadInfrastructure', 'IsTooManyPrivateVehicles'], inplace=True)

print("Chi-square tests performed and proportional stacked bar charts displayed for congestion causes and extra commute time.")

## Test Hypothesis 3 (Age Group and Switch to Public Transport)

Test the hypothesis that younger age groups are more likely to express a willingness to switch to public transport if it were more reliable compared to older age groups. This will involve performing a Chi-square test and generating a proportional stacked bar chart.


In [None]:
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns

switch_transport_categories = ['Definitely not', 'Probably not', 'Not sure', 'Probably yes', 'Yes, definitely']

df['SwitchToPublicTransport_labels'] = df['SwitchToPublicTransport'].map(lambda x: switch_transport_categories[x])

contingency_table_age_switch = pd.crosstab(df['Age'], df['SwitchToPublicTransport'])
print("Contingency Table (Age vs. SwitchToPublicTransport):\n", contingency_table_age_switch)

chi2_age_switch, p_value_age_switch, dof_age_switch, expected_age_switch = chi2_contingency(contingency_table_age_switch)

print(f"\nChi-square Statistic (Age vs. SwitchToPublicTransport): {chi2_age_switch:.2f}")
print(f"P-value (Age vs. SwitchToPublicTransport): {p_value_age_switch:.3f}")
print(f"Degrees of Freedom (Age vs. SwitchToPublicTransport): {dof_age_switch}")

contingency_table_age_switch_proportional = pd.crosstab(df['Age'], df['SwitchToPublicTransport_labels'])
contingency_table_age_switch_proportional = contingency_table_age_switch_proportional.apply(lambda r: r/r.sum(), axis=1)

contingency_table_age_switch_proportional = contingency_table_age_switch_proportional[switch_transport_categories]

age_labels = ['Below 18', '18-30', '30-45', '>45']
contingency_table_age_switch_proportional.index = contingency_table_age_switch_proportional.index.map(lambda x: age_labels[x] if x >= 0 and x < len(age_labels) else 'Unknown')

plt.figure(figsize=(12, 7))
contingency_table_age_switch_proportional.plot(kind='bar', stacked=True, figsize=(12, 7), cmap='viridis', ax=plt.gca())
plt.title('Proportional Willingness to Switch to Public Transport by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Proportion')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Willingness to Switch', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

df.drop(columns=['SwitchToPublicTransport_labels'], inplace=True)

print("Chi-square test performed and proportional stacked bar chart displayed for Age vs. SwitchToPublicTransport.")

## Conclusion:

### Data Analysis Key Findings

*   **Commute Mode and Congestion Frequency (Hypothesis 1)**: A statistically significant relationship exists between the primary commute mode and the frequency of traffic congestion (p-value = 0.021). This suggests that differences in reported congestion frequency across various commute modes are not due to chance.
*   **Congestion Causes and Extra Commute Time (Hypothesis 2)**:
    *   A statistically significant association exists between perceiving 'Poor road infrastructure' as a main cause of congestion and the reported 'ExtraTimeCongestion' (p-value < 0.001).
    *   A statistically significant association also exists between perceiving 'Too many private vehicles' as a main cause of congestion and the reported 'ExtraTimeCongestion' (p-value = 0.025).
*   **Age Group and Willingness to Switch to Public Transport (Hypothesis 3)**: There is a statistically significant relationship between age group and willingness to switch to public transport if it were more reliable (p-value = 0.041). This indicates that certain age groups are indeed more inclined to consider public transport alternatives.

### Insights or Next Steps/Recommendation

*   **Targeted Infrastructure Improvements:** Given the significant association between 'Poor road infrastructure' and increased commute times, efforts to improve road infrastructure could directly alleviate congestion duration, especially in areas where this is a primary concern.
*   **Promote Public Transport for Younger Demographics:** The statistically significant willingness among younger age groups to switch to public transport presents an opportunity. Campaigns and improvements targeting this demographic could significantly increase public transport adoption and reduce private vehicle reliance.