# Statistical Analysis in Python: EDA, Visualization, and Inference

## Introduction

In this notebook, we will explore statistical analysis using Python, focusing on three critical components: **Exploratory Data Analysis (EDA)**, **Data Visualization**, and **Statistical Inference**. We'll use Python's most powerful data analysis libraries: **pandas** for data manipulation, **matplotlib** and **seaborn** for static visualization, **plotly** for interactive graphics, **scipy** for statistical testing, and **statsmodels** for regression analysis.

## Learning Outcomes

### Exploratory Data Analysis (EDA)
- Load, inspect, and clean datasets; create derived variables and demographic indicators
- Perform comprehensive analysis (univariate, bivariate, multivariate) and handle outliers

### Data Visualization
- Master matplotlib, seaborn, and plotly for statistical and interactive visualizations
- Apply best practices to create publication-ready, multi-panel graphics

### Statistical Inference
- Formulate and test hypotheses using scipy and statsmodels; perform common statistical tests
- Build multiple linear regression models, calculate confidence intervals, and check assumptions


## Dataset Description
This dataset was created by combining population density data and building footprints. The population density data is age-group specific from Meta. The building footprints were downloaded from Google's Open Buildings dataset. To generate the data at cell-level, I utilized GIS packages in Python. You will learn about this processing in the next session when we cover Module 3 (Spatial Data Processing).

**Dataset Overview:**
- **Original population density data:** [Gridded Population Density by Age Group](https://drive.google.com/file/d/10ReitvO0LWFT6CnuJEHZzJZGG3WdL75j/view?usp=share_link)
- **Building Footprints:** [Google Open Buildings](https://sites.research.google/gr/open-buildings/#open-buildings-download)
- **Download link:** [Download data file from here](https://drive.google.com/file/d/1FWEFGdN-xDuFH1jmt0hr4F8Xc3Y5XzvB/view?usp=sharing)


### Variable Descriptions

#### Geographic Identifiers
| Variable | Type | Description |
|----------|------|-------------|
| `cell_id` | String | Unique identifier for each administrative cell |
| `province_name` | String | Province name (5 provinces: Kigali, Eastern, Western, Northern, Southern) |
| `district_name` | String | District name within province (30 districts total) |
| `sector_name` | String | Sector name within district (administrative subdivision) |
| `cell_name` | String | Cell name (smallest administrative unit) |

#### Demographic Variables (2020 Population Estimates)
| Variable | Description |
|----------|-------------|
| `general_2020` | Total population in the cell |
| `elderly_60_plus_2020` | Population aged 60 years and above |
| `children_under_five_2020` | Population under 5 years of age |
| `youth_15_24_2020` | Population aged 15-24 years |
| `men_2020` | Male population |
| `women_2020` | Female population |

#### Infrastructure Variable
| Variable | Type | Description |
|----------|------|-------------|
| `building_count` | Float | Number of buildings/structures in the cell |

In [None]:
DIR_DATA = Path.cwd().parents[1] / "data"
FILE_CELL_POP = DIR_DATA / "population" / "rwa-cell-pop.csv"