# Basics of Statistics

Statistics is the science of collecting, analyzing, and presenting data. It's fundamental to extracting insights from data in machine learning.

### Key Concepts in Statistics
1. **Variables:** Characteristics or attributes that can be measured or observed (e.g., gender, preferred newspaper, blood pressure, mode of transport to work).
    * Example: In a customer dataset, variables might include `age` (numerical), `country` (categorical), `has_made_purchase` (boolean), `total_spent` (numerical), `customer_segment` (categorical).
2. **Data Collection:** The process of gathering information about variables. This can be through surveys, experiments, or other methods. Data is often organized in tables, where rows represent individual observations (e.g., people) and columns represent variables.
    * Example: Collecting data on website user behavior (clicks, page views, time spent on site) to understand engagement.
3. **Descriptive vs. Inferential Statistics:** 
    * **Descriptive statistics** summarizes and describes the collected data _itself_. It doesn't make generalizations beyond the data at hand.
    * **Inferential statistics** uses a _sample_ of data to make inferences or draw conclusions about a larger _population_. This is crucial for machine learning, where we want our models to generalize to new data.
    * Example: Use descriptive statistics when you want to calculate the mean of your training data. Use inferential statistics to generalize to new data.

### Descriptive Statistics: Summarizing Data
Descriptive statistics provide a concise overview of a dataset. The main components are:
1. **Measures of Central Tendency:** These describe the "typical" or "central" value in a dataset. 
    * **Mean:** The average value (sum of all values divided by the number of values). Sensitive to outliers (extreme values).
        * Example: Average test score, average customer spending, average click-through rate.
    * **Median:** The middle value when the data is sorted. Resistant to outliers.
        * Example: Median house price in a city (less affected by a few extremely expensive houses than the mean would be). Median income.
    * **Mode:** The most frequent value.
        * Example: The most common product purchased on an e-commerce site, the most frequent user action on a website.
2. **Measures of Dispersion:** These describe how spread out the data is.
    * **Standard Deviation (`σ`):** The average distance of data points from the mean. A low standard deviation means data points are clustered close to the mean; a high standard deviation means they are more spread out. 
        * Example: If the standard deviation of customer ages is small, it means most customers are close to the average age. If it's large, the customer base is more diverse in age.
    * **Variance:** The square of the standard deviation.
        * Example: In finance, variance is often used to measure the volatility (risk) of an investment. 
    * **Range:** The difference between the maximum and minimum values. 
        * Example: Range of delivery times.
    * **Interquartile Range (IQR):** The range of the middle 50% of the data (between the 25th and 75th percentiles). Resistant to outliers.
        * Example: Used in box plots to show the spread of the central part of the data, and to identify outliers.
3. **Frequency Tables:** Show how often each distinct value of a variable occurs. Useful for categorical data.
    * Example: A table showing the number of users from each country visiting a website.
4. **Contingency Tables (Cross-Tabs):** Show the relationship between two categorical variables. Each cell shows the count of observations that fall into a specific combination of categories.
    * Example: A table showing the number of customers who purchased a particular product _and_ belong to a specific customer segment. This helps understand if certain segments are more likely to buy certain products.
5. **Charts and Graphs:** Visual representations of data. See the [Data Visualization with Python](../../02_application/data_visualization_with_python/01_introduction_to_data_visualization_tools.ipynb) section for the application.
    * **Bar Charts:** Good for displaying frequencies of categorical variables. Example: A bar chart showing the number of users in each subscription plan (free, basic, premium).
    * **Pie Charts:** Show proportions of categories. Example: Market share of different companies.
    * **Histograms:** Show the distribution of a numerical variable. Example: A histogram of website session durations, showing how many sessions lasted 0-1 minute, 1-2 minutes, 2-3 minutes, etc.
    * **Box Plots:** Display the median, quartiles, and outliers of a numerical variable. Example: Compare the distributions of purchase amounts across different customer groups.
    * **Violin Plots:** Similar to box plots but also show the density of the data. Example: Compare the distribution of ages for users who clicked on different advertisements.

### Inferential Statistics: Making Inferences About Populations
