# Descriptive Statistics Tutorial

This notebook demonstrates key concepts from descriptive statistics including:
- Data storage and manipulation
- Descriptive statistics (mean, median, mode)
- Data visualizations (pie charts, bar charts, line charts, scatter plots)
- Trend analysis and correlation

## 1. Import Required Libraries

First, we'll import the necessary Python libraries for data analysis and visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

<details>
<summary>ðŸ’¡ Summary</summary>

We imported pandas for data manipulation, numpy for numerical operations, matplotlib and seaborn for visualizations, and scipy for statistical functions. These are the core libraries for data analysis in Python.
</details>

## 2. Data Storage Examples

### Creating and Storing Data in Different Formats

In [None]:
# Creating sample data about tennis players in England
tennis_data = {
    'Year': [2016, 2017, 2018, 2019, 2020, 2021],
    'NumberOfPlayers': [889300, 865100, 840200, 754900, 739900, 635700]
}

# Convert to DataFrame
df_tennis = pd.DataFrame(tennis_data)

# Display the data
print("Tennis Players in England:")
print(df_tennis)

# Save to CSV file
df_tennis.to_csv('tennis_players_england.csv', index=False)
print("\nData saved to tennis_players_england.csv")

<details>
<summary>ðŸ’¡ Summary</summary>

We created a dataset using a Python dictionary, converted it to a pandas DataFrame for easy manipulation, and saved it as a CSV file. This demonstrates the fundamental way to store and organize data in Python for analysis.
</details>

## 3. Descriptive Statistics: Mean, Median, and Mode

### Example: Website Load Times

In [None]:
# Website load times from 7 trials (in seconds)
load_times = [1.2, 1.5, 2.0, 1.2, 3.2, 1.9, 80]

# Calculate descriptive statistics
mean_time = np.mean(load_times)
median_time = np.median(load_times)
mode_result = stats.mode(load_times, keepdims=True)
mode_time = mode_result.mode[0]

print("Website Load Time Analysis:")
print(f"Data: {load_times}")
print(f"\nMean (Average): {mean_time:.2f} seconds")
print(f"Median: {median_time:.2f} seconds")
print(f"Mode: {mode_time:.2f} seconds")
print(f"\nMinimum: {min(load_times):.2f} seconds")
print(f"Maximum: {max(load_times):.2f} seconds")

<details>
<summary>ðŸ’¡ Summary</summary>

The mean (13 seconds) is heavily skewed by the outlier value of 80 seconds. The median (1.9 seconds) provides a better representation of the typical load time since it's less sensitive to outliers. The mode (1.2 seconds) shows the most frequently occurring value.
</details>

### Handling Outliers: Data Cleaning

In [None]:
# Remove the outlier (80 seconds)
load_times_cleaned = [1.2, 1.5, 2.0, 1.2, 3.2, 1.9]

# Recalculate statistics
mean_cleaned = np.mean(load_times_cleaned)
median_cleaned = np.median(load_times_cleaned)
mode_cleaned = stats.mode(load_times_cleaned, keepdims=True).mode[0]

print("After Removing Outlier:")
print(f"Data: {load_times_cleaned}")
print(f"\nMean (Average): {mean_cleaned:.2f} seconds")
print(f"Median: {median_cleaned:.2f} seconds")
print(f"Mode: {mode_cleaned:.2f} seconds")

# Visualize the difference
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.boxplot([load_times], labels=['With Outlier'])
ax1.set_ylabel('Load Time (seconds)')
ax1.set_title('Load Times - With Outlier')

ax2.boxplot([load_times_cleaned], labels=['Without Outlier'])
ax2.set_ylabel('Load Time (seconds)')
ax2.set_title('Load Times - After Cleaning')

plt.tight_layout()
plt.show()

<details>
<summary>ðŸ’¡ Summary</summary>

After removing the outlier, the mean dropped from 13 seconds to 1.83 seconds, now better representing the typical load time. The box plots visually demonstrate how outliers can distort data analysis. Data cleaning is essential for accurate statistical analysis.
</details>

### Comparing Different Datasets

In [None]:
# Two groups of developers' coding experience (in years)
respondents_a = [5, 3, 7, 10, 5, 6, 8, 12, 5, 15, 20, 7, 9, 5, 11, 8, 6, 14, 25, 9]
respondents_b = [5, 3, 7, 200, 5, 6, 8, 12, 5, 15, 20, 7, 9, 5, 150, 8, 6, 14, 25, 9]

# Calculate statistics for both groups
print("Respondents A (No Outliers):")
print(f"Mean: {np.mean(respondents_a):.2f} years")
print(f"Median: {np.median(respondents_a):.2f} years")
print(f"Mode: {stats.mode(respondents_a, keepdims=True).mode[0]:.0f} years")

print("\nRespondents B (With Outliers):")
print(f"Mean: {np.mean(respondents_b):.2f} years")
print(f"Median: {np.median(respondents_b):.2f} years")
print(f"Mode: {stats.mode(respondents_b, keepdims=True).mode[0]:.0f} years")

# Create comparison visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.hist(respondents_a, bins=10, edgecolor='black', alpha=0.7)
ax1.axvline(np.mean(respondents_a), color='red', linestyle='--', label=f'Mean: {np.mean(respondents_a):.2f}')
ax1.axvline(np.median(respondents_a), color='green', linestyle='--', label=f'Median: {np.median(respondents_a):.2f}')
ax1.set_xlabel('Years of Experience')
ax1.set_ylabel('Frequency')
ax1.set_title('Respondents A - No Outliers')
ax1.legend()

ax2.hist(respondents_b, bins=10, edgecolor='black', alpha=0.7)
ax2.axvline(np.mean(respondents_b), color='red', linestyle='--', label=f'Mean: {np.mean(respondents_b):.2f}')
ax2.axvline(np.median(respondents_b), color='green', linestyle='--', label=f'Median: {np.median(respondents_b):.2f}')
ax2.set_xlabel('Years of Experience')
ax2.set_ylabel('Frequency')
ax2.set_title('Respondents B - With Outliers')
ax2.legend()

plt.tight_layout()
plt.show()

<details>
<summary>ðŸ’¡ Summary</summary>

For Respondents A without outliers, the mean (11.55 years) is appropriate. For Respondents B with outliers (200 and 150 years of coding), the median (8 years) is more representative than the mean (52.25 years). The histograms show how outliers pull the mean away from the median.
</details>

## 4. Data Visualization: Pie Charts

### Product Sales Revenue Distribution

In [None]:
# Product sales data
products = ['Product A', 'Product B', 'Product C', 'Product D']
sales_percentages = [60, 20, 12, 8]
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']

# Create pie chart
plt.figure(figsize=(10, 8))
plt.pie(sales_percentages, labels=products, autopct='%1.1f%%', 
        colors=colors, startangle=90, explode=(0.1, 0, 0, 0))
plt.title('Product Sales Revenue Distribution', fontsize=16, fontweight='bold')
plt.axis('equal')
plt.show()

print("Sales Analysis:")
for product, percentage in zip(products, sales_percentages):
    print(f"{product}: {percentage}%")

<details>
<summary>ðŸ’¡ Summary</summary>

The pie chart clearly shows that Product A dominates sales with 60% of total revenue, representing more than half of the company's sales. Product A is exploded (separated) from the pie to emphasize its importance. Pie charts are most effective when showing part-to-whole relationships with a small number of categories.
</details>

### Operating System Preferences

In [None]:
# Operating system preferences data
os_data = {
    'Operating System': ['Windows', 'MacOS', 'Linux-based', 'BSD'],
    'Count': [5, 3, 10, 2]
}

df_os = pd.DataFrame(os_data)

# Create pie chart
plt.figure(figsize=(10, 8))
colors_os = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']
plt.pie(df_os['Count'], labels=df_os['Operating System'], autopct='%1.1f%%',
        colors=colors_os, startangle=140)
plt.title('Developer Operating System Preferences', fontsize=16, fontweight='bold')
plt.axis('equal')
plt.show()

print("\nOperating System Distribution:")
print(df_os.to_string(index=False))

<details>
<summary>ðŸ’¡ Summary</summary>

Linux-based operating systems are the most popular among respondents with 50% (10 out of 20 responses), followed by Windows at 25%. MacOS and BSD have the smallest slices at 15% and 10% respectively. This pie chart effectively communicates the preference distribution across four operating systems.
</details>

## 5. Data Visualization: Bar Charts

### Programming Language Popularity

In [None]:
# Programming language popularity data
languages = ['JavaScript', 'Python', 'Java', 'C#', 'C++', 'PHP', 'TypeScript', 'C']
popularity = [18, 15, 12, 8, 7, 6, 5, 4]

# Create horizontal bar chart
plt.figure(figsize=(12, 8))
bars = plt.barh(languages, popularity, color='steelblue', edgecolor='black')

# Add value labels on bars
for i, (bar, value) in enumerate(zip(bars, popularity)):
    plt.text(value + 0.3, i, f'{value}', va='center', fontweight='bold')

plt.xlabel('Number of Respondents', fontsize=12, fontweight='bold')
plt.ylabel('Programming Language', fontsize=12, fontweight='bold')
plt.title('Programming Language Experience', fontsize=16, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nProgramming Language Rankings:")
for i, (lang, count) in enumerate(zip(languages, popularity), 1):
    print(f"{i}. {lang}: {count} respondents")

<details>
<summary>ðŸ’¡ Summary</summary>

The horizontal bar chart makes it easy to compare programming languages. JavaScript is the most popular with 18 respondents, followed by Python (15) and Java (12). Bar charts excel at comparing values across categoriesâ€”the length differences are easier to perceive than pie chart slice areas. Horizontal bars work well when category names are long.
</details>

### Web Framework Comparison

In [None]:
# Web framework popularity (based on StackOverflow survey)
frameworks = ['React.js', 'jQuery', 'Angular', 'Vue.js', 'Express', 'ASP.NET', 'Django']
percentages = [40.14, 34.43, 26.23, 18.97, 23.82, 18.1, 14.99]

# Create vertical bar chart
plt.figure(figsize=(12, 8))
bars = plt.bar(frameworks, percentages, color='coral', edgecolor='black', alpha=0.8)

# Customize the chart
plt.xlabel('Web Framework', fontsize=12, fontweight='bold')
plt.ylabel('Percentage (%)', fontsize=12, fontweight='bold')
plt.title('Most Popular Web Frameworks (2021 Stack Overflow Survey)', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)

# Add value labels on top of bars
for bar, value in zip(bars, percentages):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'{value}%', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

<details>
<summary>ðŸ’¡ Summary</summary>

React.js leads as the most popular web framework at 40.14%, followed by jQuery at 34.43%. The vertical column chart format is ideal for this data. We can clearly see that Angular (26.23%) is more popular than Vue.js (18.97%), and the visual comparison is much easier than with a pie chart.
</details>

## 6. Identifying Trends in Data: Line Charts

### Mobile and Telephone Subscriptions Over Time

In [None]:
# UK telecommunications data (2000-2019)
years = list(range(2000, 2020))
mobile_subs = [74, 78, 80, 85, 90, 95, 98, 100, 102, 102, 102, 103, 103, 103, 102, 102, 101, 101, 100, 100]
fixed_telephone = [60, 59.5, 59, 58.5, 58, 57.5, 57, 56.5, 56, 55.5, 55, 54.5, 54, 53.5, 53, 52.5, 52, 51.5, 51, 50.5]
broadband = [0, 0.5, 1, 2, 4, 8, 15, 25, 30, 35, 37, 38, 38.5, 39, 39.2, 39.4, 39.6, 39.8, 40, 40.2]

# Create line chart
plt.figure(figsize=(14, 8))
plt.plot(years, mobile_subs, marker='o', linewidth=2, label='Mobile Cellular', color='#3498db')
plt.plot(years, fixed_telephone, marker='s', linewidth=2, label='Fixed Telephone', color='#e74c3c')
plt.plot(years, broadband, marker='^', linewidth=2, label='Fixed Broadband', color='#2ecc71')

plt.xlabel('Year', fontsize=12, fontweight='bold')
plt.ylabel('Subscriptions (per 100 people)', fontsize=12, fontweight='bold')
plt.title('UK Telecommunications Subscriptions (2000-2019)', fontsize=16, fontweight='bold')
plt.legend(fontsize=11, loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Trend Analysis:")
print(f"Mobile Cellular: Increased from {mobile_subs[0]} (2000) to {mobile_subs[7]} (2007), then remained constant")
print(f"Fixed Telephone: Steady decrease from {fixed_telephone[0]} (2000) to {fixed_telephone[-1]} (2019)")
print(f"Fixed Broadband: Rapid increase from {broadband[0]} (2000) to {broadband[-1]} (2019)")

<details>
<summary>ðŸ’¡ Summary</summary>

Line charts effectively show trends over time. Mobile subscriptions show an **upward trend** (2000-2007) followed by a **constant trend**. Fixed telephone shows a **downward trend** throughout the period. Fixed broadband displays a **significant upward trend**, especially between 2000-2010. The different colored lines with markers make it easy to track each subscription type.
</details>

### Programming Language Trends (2013-2021)

In [None]:
# Programming language usage trends
years_prog = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
javascript = [62, 63, 64, 65, 69, 71, 69, 68, 65]
java = [38, 39, 39, 40, 40, 40, 39, 38, 36]
python = [18, 20, 22, 25, 28, 32, 38, 42, 48]

# Create line chart
plt.figure(figsize=(14, 8))
plt.plot(years_prog, javascript, marker='o', linewidth=3, label='JavaScript', color='#f7df1e', markersize=8)
plt.plot(years_prog, java, marker='s', linewidth=3, label='Java', color='#007396', markersize=8)
plt.plot(years_prog, python, marker='^', linewidth=3, label='Python', color='#3776ab', markersize=8)

plt.xlabel('Year', fontsize=12, fontweight='bold')
plt.ylabel('Percentage of Developers (%)', fontsize=12, fontweight='bold')
plt.title('Programming Language Popularity Trends (2013-2021)', fontsize=16, fontweight='bold')
plt.legend(fontsize=12, loc='best')
plt.grid(True, alpha=0.3)
plt.xticks(years_prog)
plt.tight_layout()
plt.show()

# Calculate rate of change
js_change = javascript[-1] - javascript[0]
java_change = java[-1] - java[0]
python_change = python[-1] - python[0]

print("\nTrend Analysis (2013-2021):")
print(f"JavaScript: {js_change:+.0f} percentage points (peaked in 2018, then declining)")
print(f"Java: {java_change:+.0f} percentage points (constant 2013-2018, declining since)")
print(f"Python: {python_change:+.0f} percentage points (strong upward trend throughout)")

<details>
<summary>ðŸ’¡ Summary</summary>

Python shows a consistent **strong upward trend** throughout 2013-2021, growing by 30 percentage points. JavaScript was relatively constant (2013-2016), increased significantly (2016-2018), then slowly decreased. Java remained constant (2013-2018) then shows a slow downward trend. Based on these trends, Python is likely to continue gaining popularity, potentially surpassing JavaScript within the next decade.
</details>

## 7. Identifying Patterns: Scatter Charts and Correlation

### Developer Salary vs Experience

In [None]:
# Developer types data (experience vs salary)
developer_data = {
    'Type': ['Junior', 'Mid', 'Senior', 'Manager', 'SRE', 'DevOps', 'Data', 
             'Data Engineering', 'Scientist', 'Mobile', 'Frontend', 'Backend'],
    'Experience': [3.5, 6.8, 15.3, 13.5, 11.2, 10.8, 8.6, 9.8, 10.2, 9.4, 7.2, 8.9],
    'Salary': [42, 58, 95, 95, 85, 71, 65, 78, 75, 44, 52, 62]
}

df_dev = pd.DataFrame(developer_data)

# Create scatter plot with trend line
plt.figure(figsize=(14, 9))
plt.scatter(df_dev['Experience'], df_dev['Salary'], s=200, alpha=0.6, 
            c=df_dev['Salary'], cmap='viridis', edgecolors='black', linewidth=1.5)

# Add labels for each point
for i, row in df_dev.iterrows():
    plt.annotate(row['Type'], (row['Experience'], row['Salary']), 
                xytext=(5, 5), textcoords='offset points', fontsize=9, fontweight='bold')

# Add trend line
z = np.polyfit(df_dev['Experience'], df_dev['Salary'], 1)
p = np.poly1d(z)
plt.plot(df_dev['Experience'], p(df_dev['Experience']), "r--", linewidth=2, 
         alpha=0.8, label=f'Trend Line: y={z[0]:.2f}x+{z[1]:.2f}')

# Calculate correlation
correlation = df_dev['Experience'].corr(df_dev['Salary'])

plt.xlabel('Average Years of Professional Experience', fontsize=12, fontweight='bold')
plt.ylabel('Median Yearly Salary (USD, thousands)', fontsize=12, fontweight='bold')
plt.title(f'Developer Salary vs Experience\nCorrelation: {correlation:.3f}', 
          fontsize=16, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.colorbar(label='Salary (thousands USD)')
plt.tight_layout()
plt.show()

print(f"Correlation coefficient: {correlation:.3f}")
print("\nInterpretation: The correlation is moderate-to-weak and positive.")
print("While experience does influence salary, developer type also plays a significant role.")

<details>
<summary>ðŸ’¡ Summary</summary>

The scatter plot shows a **positive correlation** with **moderate strength** between experience and salary (correlation â‰ˆ 0.6-0.7). The trend line goes upward, but the points are scattered rather than forming a tight line. This indicates that while more experience generally leads to higher salary, the developer type (Senior, Manager, SRE, etc.) also significantly impacts compensation. For example, Senior and Manager roles have similar salaries (~$95k) despite different experience levels (15.3 vs 13.5 years).
</details>

### Test Scores vs Study Time

In [None]:
# Student test scores and study time
study_time = [1, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9]
test_scores = [55, 58, 62, 65, 68, 70, 73, 76, 78, 80, 83, 85, 87, 90, 92, 95]

# Create scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(study_time, test_scores, s=150, alpha=0.6, c='blue', 
            edgecolors='black', linewidth=1.5, label='Student Scores')

# Add trend line
z = np.polyfit(study_time, test_scores, 1)
p = np.poly1d(z)
plt.plot(study_time, p(study_time), "r--", linewidth=2, alpha=0.8, 
         label=f'Trend Line: y={z[0]:.2f}x+{z[1]:.2f}')

# Calculate correlation
correlation = np.corrcoef(study_time, test_scores)[0, 1]

plt.xlabel('Study Time (hours per day)', fontsize=12, fontweight='bold')
plt.ylabel('Test Score', fontsize=12, fontweight='bold')
plt.title(f'Test Scores vs Study Time\nCorrelation: {correlation:.3f} (Strong Positive)', 
          fontsize=16, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Correlation coefficient: {correlation:.3f}")
print(f"\nDirection: Positive (as study time increases, test scores increase)")
print(f"Form: Linear")
print(f"Strength: Very Strong (correlation > 0.95)")
print(f"\nConclusion: There is a clear, strong positive relationship between study time and test scores.")

<details>
<summary>ðŸ’¡ Summary</summary>

This scatter plot demonstrates a **strong positive linear correlation** (r > 0.99) between study time and test scores. The points closely follow the trend line, forming a clear linear pattern. This indicates that as study time increases, test scores consistently increase. The relationship is very predictableâ€”students who study more hours per day achieve higher test scores.
</details>

### Coding Experience vs Survey Ease

In [None]:
# Years of coding vs survey ease rating (1-5, where 5 is easiest)
np.random.seed(42)
years_coding = np.random.randint(1, 25, 30)
# Negative correlation: more experience = found survey easier (higher rating)
survey_ease = 5 - (years_coding / 5) + np.random.normal(0, 0.5, 30)
survey_ease = np.clip(survey_ease, 1, 5)  # Keep ratings between 1 and 5

# Create scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(years_coding, survey_ease, s=120, alpha=0.6, c='purple', 
            edgecolors='black', linewidth=1.5)

# Add trend line
z = np.polyfit(years_coding, survey_ease, 1)
p = np.poly1d(z)
plt.plot(years_coding, p(years_coding), "r--", linewidth=2, alpha=0.8,
         label=f'Trend Line: y={z[0]:.3f}x+{z[1]:.2f}')

# Calculate correlation
correlation = np.corrcoef(years_coding, survey_ease)[0, 1]

plt.xlabel('Years of Coding Experience', fontsize=12, fontweight='bold')
plt.ylabel('Survey Ease Rating (1=Hard, 5=Easy)', fontsize=12, fontweight='bold')
plt.title(f'Coding Experience vs Survey Difficulty\nCorrelation: {correlation:.3f}', 
          fontsize=16, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim(0, 6)
plt.tight_layout()
plt.show()

if correlation < -0.3:
    strength = "moderate"
    if abs(correlation) > 0.6:
        strength = "strong"
    print(f"Correlation coefficient: {correlation:.3f}")
    print(f"\nDirection: Negative")
    print(f"Strength: {strength.capitalize()}")
    print(f"\nConclusion: There is a {strength} negative correlation.")
    print("Developers with more coding experience found the survey easier to answer.")
else:
    print(f"Correlation coefficient: {correlation:.3f}")
    print(f"\nThe correlation is weak. Experience doesn't strongly predict survey ease.")

<details>
<summary>ðŸ’¡ Summary</summary>

This scatter plot shows a **negative correlation** between years of coding experience and survey difficulty rating. The downward-sloping trend line indicates that more experienced developers found the survey easier (lower difficulty score means easier). The correlation strength is moderate, meaning the relationship exists but isn't perfectly predictableâ€”other factors also influenced how easy respondents found the survey.
</details>

## 8. Types of Correlation

### Demonstrating Different Correlation Patterns

In [None]:
# Create different correlation examples
np.random.seed(42)
x = np.linspace(0, 10, 50)

# Strong positive linear
y_strong_pos = 2 * x + 3 + np.random.normal(0, 1, 50)

# Weak positive linear
y_weak_pos = 0.5 * x + 5 + np.random.normal(0, 3, 50)

# Strong negative linear
y_strong_neg = -2 * x + 20 + np.random.normal(0, 1, 50)

# No correlation
y_no_corr = np.random.normal(10, 3, 50)

# Non-linear (quadratic)
y_nonlinear = 0.3 * (x - 5)**2 + 2 + np.random.normal(0, 0.5, 50)

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Different Types of Correlation', fontsize=18, fontweight='bold', y=1.00)

# Strong Positive
axes[0, 0].scatter(x, y_strong_pos, alpha=0.6, s=80)
z = np.polyfit(x, y_strong_pos, 1)
axes[0, 0].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[0, 0].set_title(f'Strong Positive Linear\nr = {np.corrcoef(x, y_strong_pos)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Weak Positive
axes[0, 1].scatter(x, y_weak_pos, alpha=0.6, s=80, color='orange')
z = np.polyfit(x, y_weak_pos, 1)
axes[0, 1].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[0, 1].set_title(f'Weak Positive Linear\nr = {np.corrcoef(x, y_weak_pos)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Strong Negative
axes[0, 2].scatter(x, y_strong_neg, alpha=0.6, s=80, color='red')
z = np.polyfit(x, y_strong_neg, 1)
axes[0, 2].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[0, 2].set_title(f'Strong Negative Linear\nr = {np.corrcoef(x, y_strong_neg)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[0, 2].grid(True, alpha=0.3)

# No Correlation
axes[1, 0].scatter(x, y_no_corr, alpha=0.6, s=80, color='gray')
z = np.polyfit(x, y_no_corr, 1)
axes[1, 0].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[1, 0].set_title(f'No Correlation\nr = {np.corrcoef(x, y_no_corr)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Non-linear
axes[1, 1].scatter(x, y_nonlinear, alpha=0.6, s=80, color='purple')
z = np.polyfit(x, y_nonlinear, 2)
axes[1, 1].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[1, 1].set_title(f'Non-linear (Quadratic)\nr = {np.corrcoef(x, y_nonlinear)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

# Summary text
axes[1, 2].axis('off')
summary_text = """
Correlation Summary:

DIRECTION:
â€¢ Positive: Both variables increase
â€¢ Negative: One increases, other decreases
â€¢ None: No clear relationship

STRENGTH (|r| value):
â€¢ 0.8 - 1.0: Very Strong
â€¢ 0.6 - 0.8: Strong
â€¢ 0.4 - 0.6: Moderate
â€¢ 0.2 - 0.4: Weak
â€¢ 0.0 - 0.2: Very Weak/None

FORM:
â€¢ Linear: Straight line pattern
â€¢ Non-linear: Curved pattern
"""
axes[1, 2].text(0.1, 0.5, summary_text, fontsize=11, 
                family='monospace', verticalalignment='center')

plt.tight_layout()
plt.show()

<details>
<summary>ðŸ’¡ Summary</summary>

This visualization demonstrates the three key aspects of correlation:

1. **Direction**: Positive (upward), negative (downward), or none (flat)
2. **Strength**: How closely points follow the trend line (strong correlations have tightly clustered points)
3. **Form**: Linear (straight line) or non-linear (curved pattern)

The correlation coefficient (r) ranges from -1 to +1, where values closer to Â±1 indicate stronger correlations.
</details>

## 9. Summary: Key Takeaways

### Descriptive Statistics Concepts

In [None]:
# Create a summary visualization
summary_data = {
    'Concept': ['Mean', 'Median', 'Mode', 'Pie Chart', 'Bar Chart', 'Line Chart', 'Scatter Plot'],
    'Use Case': [
        'Average value (no outliers)',
        'Typical value (with outliers)',
        'Most frequent value',
        'Part-to-whole (â‰¤4 categories)',
        'Compare categories',
        'Trends over time',
        'Correlation between variables'
    ],
    'Category': ['Statistics', 'Statistics', 'Statistics', 'Visualization', 
                 'Visualization', 'Visualization', 'Visualization']
}

df_summary = pd.DataFrame(summary_data)

print("="*80)
print("DESCRIPTIVE STATISTICS - KEY CONCEPTS SUMMARY")
print("="*80)
print("\n")
print(df_summary.to_string(index=False))
print("\n")
print("="*80)
print("IMPORTANT REMINDERS:")
print("="*80)
print("1. Always check for outliers before calculating the mean")
print("2. Use median when data contains outliers")
print("3. Use mode for nominal (categorical) data")
print("4. Choose the right chart type for your data and question")
print("5. Correlation does not imply causation")
print("6. Always visualize data to identify patterns and anomalies")
print("="*80)

<details>
<summary>ðŸ’¡ Summary</summary>

This notebook covered the fundamental concepts of descriptive statistics:

- **Descriptive Statistics**: Mean, median, and mode help summarize data with a single representative value
- **Data Cleaning**: Identify and handle outliers to ensure accurate analysis
- **Visualizations**: Different chart types serve different purposes:
  - Pie charts for part-to-whole relationships
  - Bar charts for comparing categories
  - Line charts for showing trends over time
  - Scatter plots for identifying correlations
- **Correlation Analysis**: Understanding direction, form, and strength of relationships between variables

These tools form the foundation for exploratory data analysis and statistical inference.
</details>

# Descriptive Statistics Tutorial

This notebook demonstrates key concepts from descriptive statistics including:
- Data storage and manipulation
- Descriptive statistics (mean, median, mode)
- Data visualizations (pie charts, bar charts, line charts, scatter plots)
- Trend analysis and correlation

## 1. Import Required Libraries

First, we'll import the necessary Python libraries for data analysis and visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

**Summary:** We imported pandas for data manipulation, numpy for numerical operations, matplotlib and seaborn for visualizations, and scipy for statistical functions. These are the core libraries for data analysis in Python.

## 2. Data Storage Examples

### Creating and Storing Data in Different Formats

In [None]:
# Creating sample data about tennis players in England
tennis_data = {
    'Year': [2016, 2017, 2018, 2019, 2020, 2021],
    'NumberOfPlayers': [889300, 865100, 840200, 754900, 739900, 635700]
}

# Convert to DataFrame
df_tennis = pd.DataFrame(tennis_data)

# Display the data
print("Tennis Players in England:")
print(df_tennis)

# Save to CSV file
df_tennis.to_csv('tennis_players_england.csv', index=False)
print("\nData saved to tennis_players_england.csv")

**Summary:** We created a dataset using a Python dictionary, converted it to a pandas DataFrame for easy manipulation, and saved it as a CSV file. This demonstrates the fundamental way to store and organize data in Python for analysis.

## 3. Descriptive Statistics: Mean, Median, and Mode

### Example: Website Load Times

In [None]:
# Website load times from 7 trials (in seconds)
load_times = [1.2, 1.5, 2.0, 1.2, 3.2, 1.9, 80]

# Calculate descriptive statistics
mean_time = np.mean(load_times)
median_time = np.median(load_times)
mode_result = stats.mode(load_times, keepdims=True)
mode_time = mode_result.mode[0]

print("Website Load Time Analysis:")
print(f"Data: {load_times}")
print(f"\nMean (Average): {mean_time:.2f} seconds")
print(f"Median: {median_time:.2f} seconds")
print(f"Mode: {mode_time:.2f} seconds")
print(f"\nMinimum: {min(load_times):.2f} seconds")
print(f"Maximum: {max(load_times):.2f} seconds")

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
The <strong>mean (13 seconds)</strong> is heavily skewed by the outlier value of 80 seconds. The <strong>median (1.9 seconds)</strong> provides a better representation of the typical load time since it's less sensitive to outliers. The <strong>mode (1.2 seconds)</strong> shows the most frequently occurring value.

### Handling Outliers: Data Cleaning

In [None]:
# Remove the outlier (80 seconds)
load_times_cleaned = [1.2, 1.5, 2.0, 1.2, 3.2, 1.9]

# Recalculate statistics
mean_cleaned = np.mean(load_times_cleaned)
median_cleaned = np.median(load_times_cleaned)
mode_cleaned = stats.mode(load_times_cleaned, keepdims=True).mode[0]

print("After Removing Outlier:")
print(f"Data: {load_times_cleaned}")
print(f"\nMean (Average): {mean_cleaned:.2f} seconds")
print(f"Median: {median_cleaned:.2f} seconds")
print(f"Mode: {mode_cleaned:.2f} seconds")

# Visualize the difference
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.boxplot([load_times], labels=['With Outlier'])
ax1.set_ylabel('Load Time (seconds)')
ax1.set_title('Load Times - With Outlier')

ax2.boxplot([load_times_cleaned], labels=['Without Outlier'])
ax2.set_ylabel('Load Time (seconds)')
ax2.set_title('Load Times - After Cleaning')

plt.tight_layout()
plt.show()

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
After removing the outlier, the mean dropped from <strong>13 seconds to 1.83 seconds</strong>, now better representing the typical load time. The box plots visually demonstrate how outliers can distort data analysis. <strong>Data cleaning is essential</strong> for accurate statistical analysis.

### Comparing Different Datasets

In [None]:
# Two groups of developers' coding experience (in years)
respondents_a = [5, 3, 7, 10, 5, 6, 8, 12, 5, 15, 20, 7, 9, 5, 11, 8, 6, 14, 25, 9]
respondents_b = [5, 3, 7, 200, 5, 6, 8, 12, 5, 15, 20, 7, 9, 5, 150, 8, 6, 14, 25, 9]

# Calculate statistics for both groups
print("Respondents A (No Outliers):")
print(f"Mean: {np.mean(respondents_a):.2f} years")
print(f"Median: {np.median(respondents_a):.2f} years")
print(f"Mode: {stats.mode(respondents_a, keepdims=True).mode[0]:.0f} years")

print("\nRespondents B (With Outliers):")
print(f"Mean: {np.mean(respondents_b):.2f} years")
print(f"Median: {np.median(respondents_b):.2f} years")
print(f"Mode: {stats.mode(respondents_b, keepdims=True).mode[0]:.0f} years")

# Create comparison visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.hist(respondents_a, bins=10, edgecolor='black', alpha=0.7)
ax1.axvline(np.mean(respondents_a), color='red', linestyle='--', label=f'Mean: {np.mean(respondents_a):.2f}')
ax1.axvline(np.median(respondents_a), color='green', linestyle='--', label=f'Median: {np.median(respondents_a):.2f}')
ax1.set_xlabel('Years of Experience')
ax1.set_ylabel('Frequency')
ax1.set_title('Respondents A - No Outliers')
ax1.legend()

ax2.hist(respondents_b, bins=10, edgecolor='black', alpha=0.7)
ax2.axvline(np.mean(respondents_b), color='red', linestyle='--', label=f'Mean: {np.mean(respondents_b):.2f}')
ax2.axvline(np.median(respondents_b), color='green', linestyle='--', label=f'Median: {np.median(respondents_b):.2f}')
ax2.set_xlabel('Years of Experience')
ax2.set_ylabel('Frequency')
ax2.set_title('Respondents B - With Outliers')
ax2.legend()

plt.tight_layout()
plt.show()

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
For <strong>Respondents A</strong> without outliers, the mean (11.55 years) is appropriate. For <strong>Respondents B</strong> with outliers (200 and 150 years of coding), the median (8 years) is more representative than the mean (52.25 years). The histograms show how outliers pull the mean away from the median.

## 4. Data Visualization: Pie Charts

### Product Sales Revenue Distribution

In [None]:
# Product sales data
products = ['Product A', 'Product B', 'Product C', 'Product D']
sales_percentages = [60, 20, 12, 8]
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']

# Create pie chart
plt.figure(figsize=(10, 8))
plt.pie(sales_percentages, labels=products, autopct='%1.1f%%', 
        colors=colors, startangle=90, explode=(0.1, 0, 0, 0))
plt.title('Product Sales Revenue Distribution', fontsize=16, fontweight='bold')
plt.axis('equal')
plt.show()

print("Sales Analysis:")
for product, percentage in zip(products, sales_percentages):
    print(f"{product}: {percentage}%")

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
The pie chart clearly shows that <strong>Product A dominates sales with 60%</strong> of total revenue, representing more than half of the company's sales. Product A is exploded (separated) from the pie to emphasize its importance. Pie charts are most effective when showing <strong>part-to-whole relationships</strong> with a small number of categories.

### Operating System Preferences

In [None]:
# Operating system preferences data
os_data = {
    'Operating System': ['Windows', 'MacOS', 'Linux-based', 'BSD'],
    'Count': [5, 3, 10, 2]
}

df_os = pd.DataFrame(os_data)

# Create pie chart
plt.figure(figsize=(10, 8))
colors_os = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']
plt.pie(df_os['Count'], labels=df_os['Operating System'], autopct='%1.1f%%',
        colors=colors_os, startangle=140)
plt.title('Developer Operating System Preferences', fontsize=16, fontweight='bold')
plt.axis('equal')
plt.show()

print("\nOperating System Distribution:")
print(df_os.to_string(index=False))

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
<strong>Linux-based operating systems</strong> are the most popular among respondents with 50% (10 out of 20 responses), followed by Windows at 25%. MacOS and BSD have the smallest slices at 15% and 10% respectively. This pie chart effectively communicates the preference distribution across four operating systems.

## 5. Data Visualization: Bar Charts

### Programming Language Popularity

In [None]:
# Programming language popularity data
languages = ['JavaScript', 'Python', 'Java', 'C#', 'C++', 'PHP', 'TypeScript', 'C']
popularity = [18, 15, 12, 8, 7, 6, 5, 4]

# Create horizontal bar chart
plt.figure(figsize=(12, 8))
bars = plt.barh(languages, popularity, color='steelblue', edgecolor='black')

# Add value labels on bars
for i, (bar, value) in enumerate(zip(bars, popularity)):
    plt.text(value + 0.3, i, f'{value}', va='center', fontweight='bold')

plt.xlabel('Number of Respondents', fontsize=12, fontweight='bold')
plt.ylabel('Programming Language', fontsize=12, fontweight='bold')
plt.title('Programming Language Experience', fontsize=16, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nProgramming Language Rankings:")
for i, (lang, count) in enumerate(zip(languages, popularity), 1):
    print(f"{i}. {lang}: {count} respondents")

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
The horizontal bar chart makes it easy to compare programming languages. <strong>JavaScript is the most popular</strong> with 18 respondents, followed by Python (15) and Java (12). Bar charts excel at comparing values across categoriesâ€”the length differences are easier to perceive than pie chart slice areas. Horizontal bars work well when category names are long.

### Web Framework Comparison

In [None]:
# Web framework popularity (based on StackOverflow survey)
frameworks = ['React.js', 'jQuery', 'Angular', 'Vue.js', 'Express', 'ASP.NET', 'Django']
percentages = [40.14, 34.43, 26.23, 18.97, 23.82, 18.1, 14.99]

# Create vertical bar chart
plt.figure(figsize=(12, 8))
bars = plt.bar(frameworks, percentages, color='coral', edgecolor='black', alpha=0.8)

# Customize the chart
plt.xlabel('Web Framework', fontsize=12, fontweight='bold')
plt.ylabel('Percentage (%)', fontsize=12, fontweight='bold')
plt.title('Most Popular Web Frameworks (2021 Stack Overflow Survey)', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)

# Add value labels on top of bars
for bar, value in zip(bars, percentages):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'{value}%', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
<strong>React.js leads</strong> as the most popular web framework at 40.14%, followed by jQuery at 34.43%. The vertical column chart format is ideal for this data. We can clearly see that Angular (26.23%) is more popular than Vue.js (18.97%), and the visual comparison is much easier than with a pie chart.

## 6. Identifying Trends in Data: Line Charts

### Mobile and Telephone Subscriptions Over Time

In [None]:
# UK telecommunications data (2000-2019)
years = list(range(2000, 2020))
mobile_subs = [74, 78, 80, 85, 90, 95, 98, 100, 102, 102, 102, 103, 103, 103, 102, 102, 101, 101, 100, 100]
fixed_telephone = [60, 59.5, 59, 58.5, 58, 57.5, 57, 56.5, 56, 55.5, 55, 54.5, 54, 53.5, 53, 52.5, 52, 51.5, 51, 50.5]
broadband = [0, 0.5, 1, 2, 4, 8, 15, 25, 30, 35, 37, 38, 38.5, 39, 39.2, 39.4, 39.6, 39.8, 40, 40.2]

# Create line chart
plt.figure(figsize=(14, 8))
plt.plot(years, mobile_subs, marker='o', linewidth=2, label='Mobile Cellular', color='#3498db')
plt.plot(years, fixed_telephone, marker='s', linewidth=2, label='Fixed Telephone', color='#e74c3c')
plt.plot(years, broadband, marker='^', linewidth=2, label='Fixed Broadband', color='#2ecc71')

plt.xlabel('Year', fontsize=12, fontweight='bold')
plt.ylabel('Subscriptions (per 100 people)', fontsize=12, fontweight='bold')
plt.title('UK Telecommunications Subscriptions (2000-2019)', fontsize=16, fontweight='bold')
plt.legend(fontsize=11, loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Trend Analysis:")
print(f"Mobile Cellular: Increased from {mobile_subs[0]} (2000) to {mobile_subs[7]} (2007), then remained constant")
print(f"Fixed Telephone: Steady decrease from {fixed_telephone[0]} (2000) to {fixed_telephone[-1]} (2019)")
print(f"Fixed Broadband: Rapid increase from {broadband[0]} (2000) to {broadband[-1]} (2019)")

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
Line charts effectively show trends over time. Mobile subscriptions show an <strong>upward trend</strong> (2000-2007) followed by a <strong>constant trend</strong>. Fixed telephone shows a <strong>downward trend</strong> throughout the period. Fixed broadband displays a <strong>significant upward trend</strong>, especially between 2000-2010. The different colored lines with markers make it easy to track each subscription type.

### Programming Language Trends (2013-2021)

In [None]:
# Programming language usage trends
years_prog = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
javascript = [62, 63, 64, 65, 69, 71, 69, 68, 65]
java = [38, 39, 39, 40, 40, 40, 39, 38, 36]
python = [18, 20, 22, 25, 28, 32, 38, 42, 48]

# Create line chart
plt.figure(figsize=(14, 8))
plt.plot(years_prog, javascript, marker='o', linewidth=3, label='JavaScript', color='#f7df1e', markersize=8)
plt.plot(years_prog, java, marker='s', linewidth=3, label='Java', color='#007396', markersize=8)
plt.plot(years_prog, python, marker='^', linewidth=3, label='Python', color='#3776ab', markersize=8)

plt.xlabel('Year', fontsize=12, fontweight='bold')
plt.ylabel('Percentage of Developers (%)', fontsize=12, fontweight='bold')
plt.title('Programming Language Popularity Trends (2013-2021)', fontsize=16, fontweight='bold')
plt.legend(fontsize=12, loc='best')
plt.grid(True, alpha=0.3)
plt.xticks(years_prog)
plt.tight_layout()
plt.show()

# Calculate rate of change
js_change = javascript[-1] - javascript[0]
java_change = java[-1] - java[0]
python_change = python[-1] - python[0]

print("\nTrend Analysis (2013-2021):")
print(f"JavaScript: {js_change:+.0f} percentage points (peaked in 2018, then declining)")
print(f"Java: {java_change:+.0f} percentage points (constant 2013-2018, declining since)")
print(f"Python: {python_change:+.0f} percentage points (strong upward trend throughout)")

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
<strong>Python</strong> shows a consistent <strong>strong upward trend</strong> throughout 2013-2021, growing by 30 percentage points. JavaScript was relatively constant (2013-2016), increased significantly (2016-2018), then slowly decreased. Java remained constant (2013-2018) then shows a slow downward trend. Based on these trends, Python is likely to continue gaining popularity, potentially surpassing JavaScript within the next decade.

## 7. Identifying Patterns: Scatter Charts and Correlation

### Developer Salary vs Experience

In [None]:
# Developer types data (experience vs salary)
developer_data = {
    'Type': ['Junior', 'Mid', 'Senior', 'Manager', 'SRE', 'DevOps', 'Data', 
             'Data Engineering', 'Scientist', 'Mobile', 'Frontend', 'Backend'],
    'Experience': [3.5, 6.8, 15.3, 13.5, 11.2, 10.8, 8.6, 9.8, 10.2, 9.4, 7.2, 8.9],
    'Salary': [42, 58, 95, 95, 85, 71, 65, 78, 75, 44, 52, 62]
}

df_dev = pd.DataFrame(developer_data)

# Create scatter plot with trend line
plt.figure(figsize=(14, 9))
plt.scatter(df_dev['Experience'], df_dev['Salary'], s=200, alpha=0.6, 
            c=df_dev['Salary'], cmap='viridis', edgecolors='black', linewidth=1.5)

# Add labels for each point
for i, row in df_dev.iterrows():
    plt.annotate(row['Type'], (row['Experience'], row['Salary']), 
                xytext=(5, 5), textcoords='offset points', fontsize=9, fontweight='bold')

# Add trend line
z = np.polyfit(df_dev['Experience'], df_dev['Salary'], 1)
p = np.poly1d(z)
plt.plot(df_dev['Experience'], p(df_dev['Experience']), "r--", linewidth=2, 
         alpha=0.8, label=f'Trend Line: y={z[0]:.2f}x+{z[1]:.2f}')

# Calculate correlation
correlation = df_dev['Experience'].corr(df_dev['Salary'])

plt.xlabel('Average Years of Professional Experience', fontsize=12, fontweight='bold')
plt.ylabel('Median Yearly Salary (USD, thousands)', fontsize=12, fontweight='bold')
plt.title(f'Developer Salary vs Experience\nCorrelation: {correlation:.3f}', 
          fontsize=16, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.colorbar(label='Salary (thousands USD)')
plt.tight_layout()
plt.show()

print(f"Correlation coefficient: {correlation:.3f}")
print("\nInterpretation: The correlation is moderate-to-weak and positive.")
print("While experience does influence salary, developer type also plays a significant role.")

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
The scatter plot shows a <strong>positive correlation</strong> with <strong>moderate strength</strong> between experience and salary (correlation â‰ˆ 0.6-0.7). The trend line goes upward, but the points are scattered rather than forming a tight line. This indicates that while more experience generally leads to higher salary, the <strong>developer type</strong> (Senior, Manager, SRE, etc.) also significantly impacts compensation. For example, Senior and Manager roles have similar salaries (~$95k) despite different experience levels (15.3 vs 13.5 years).

### Test Scores vs Study Time

In [None]:
# Student test scores and study time
study_time = [1, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9]
test_scores = [55, 58, 62, 65, 68, 70, 73, 76, 78, 80, 83, 85, 87, 90, 92, 95]

# Create scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(study_time, test_scores, s=150, alpha=0.6, c='blue', 
            edgecolors='black', linewidth=1.5, label='Student Scores')

# Add trend line
z = np.polyfit(study_time, test_scores, 1)
p = np.poly1d(z)
plt.plot(study_time, p(study_time), "r--", linewidth=2, alpha=0.8, 
         label=f'Trend Line: y={z[0]:.2f}x+{z[1]:.2f}')

# Calculate correlation
correlation = np.corrcoef(study_time, test_scores)[0, 1]

plt.xlabel('Study Time (hours per day)', fontsize=12, fontweight='bold')
plt.ylabel('Test Score', fontsize=12, fontweight='bold')
plt.title(f'Test Scores vs Study Time\nCorrelation: {correlation:.3f} (Strong Positive)', 
          fontsize=16, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Correlation coefficient: {correlation:.3f}")
print(f"\nDirection: Positive (as study time increases, test scores increase)")
print(f"Form: Linear")
print(f"Strength: Very Strong (correlation > 0.95)")
print(f"\nConclusion: There is a clear, strong positive relationship between study time and test scores.")

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
This scatter plot demonstrates a <strong>strong positive linear correlation</strong> (r > 0.99) between study time and test scores. The points closely follow the trend line, forming a clear linear pattern. This indicates that as study time increases, test scores consistently increase. The relationship is very predictableâ€”students who study more hours per day achieve higher test scores.

### Coding Experience vs Survey Ease

In [None]:
# Years of coding vs survey ease rating (1-5, where 5 is easiest)
np.random.seed(42)
years_coding = np.random.randint(1, 25, 30)
# Negative correlation: more experience = found survey easier (higher rating)
survey_ease = 5 - (years_coding / 5) + np.random.normal(0, 0.5, 30)
survey_ease = np.clip(survey_ease, 1, 5)  # Keep ratings between 1 and 5

# Create scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(years_coding, survey_ease, s=120, alpha=0.6, c='purple', 
            edgecolors='black', linewidth=1.5)

# Add trend line
z = np.polyfit(years_coding, survey_ease, 1)
p = np.poly1d(z)
plt.plot(years_coding, p(years_coding), "r--", linewidth=2, alpha=0.8,
         label=f'Trend Line: y={z[0]:.3f}x+{z[1]:.2f}')

# Calculate correlation
correlation = np.corrcoef(years_coding, survey_ease)[0, 1]

plt.xlabel('Years of Coding Experience', fontsize=12, fontweight='bold')
plt.ylabel('Survey Ease Rating (1=Hard, 5=Easy)', fontsize=12, fontweight='bold')
plt.title(f'Coding Experience vs Survey Difficulty\nCorrelation: {correlation:.3f}', 
          fontsize=16, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim(0, 6)
plt.tight_layout()
plt.show()

if correlation < -0.3:
    strength = "moderate"
    if abs(correlation) > 0.6:
        strength = "strong"
    print(f"Correlation coefficient: {correlation:.3f}")
    print(f"\nDirection: Negative")
    print(f"Strength: {strength.capitalize()}")
    print(f"\nConclusion: There is a {strength} negative correlation.")
    print("Developers with more coding experience found the survey easier to answer.")
else:
    print(f"Correlation coefficient: {correlation:.3f}")
    print(f"\nThe correlation is weak. Experience doesn't strongly predict survey ease.")

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">

<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary></details>

<div style="margin-top: 10px; padding: 10px; line-height: 1.6;"></div>
This scatter plot shows a <strong>negative correlation</strong> between years of coding experience and survey difficulty rating. The downward-sloping trend line indicates that more experienced developers found the survey easier (lower difficulty score means easier). The correlation strength is moderate, meaning the relationship exists but isn't perfectly predictableâ€”other factors also influenced how easy respondents found the survey.

## 8. Types of Correlation

### Demonstrating Different Correlation Patterns

In [None]:
# Create different correlation examples
np.random.seed(42)
x = np.linspace(0, 10, 50)

# Strong positive linear
y_strong_pos = 2 * x + 3 + np.random.normal(0, 1, 50)

# Weak positive linear
y_weak_pos = 0.5 * x + 5 + np.random.normal(0, 3, 50)

# Strong negative linear
y_strong_neg = -2 * x + 20 + np.random.normal(0, 1, 50)

# No correlation
y_no_corr = np.random.normal(10, 3, 50)

# Non-linear (quadratic)
y_nonlinear = 0.3 * (x - 5)**2 + 2 + np.random.normal(0, 0.5, 50)

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Different Types of Correlation', fontsize=18, fontweight='bold', y=1.00)

# Strong Positive
axes[0, 0].scatter(x, y_strong_pos, alpha=0.6, s=80)
z = np.polyfit(x, y_strong_pos, 1)
axes[0, 0].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[0, 0].set_title(f'Strong Positive Linear\nr = {np.corrcoef(x, y_strong_pos)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Weak Positive
axes[0, 1].scatter(x, y_weak_pos, alpha=0.6, s=80, color='orange')
z = np.polyfit(x, y_weak_pos, 1)
axes[0, 1].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[0, 1].set_title(f'Weak Positive Linear\nr = {np.corrcoef(x, y_weak_pos)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Strong Negative
axes[0, 2].scatter(x, y_strong_neg, alpha=0.6, s=80, color='red')
z = np.polyfit(x, y_strong_neg, 1)
axes[0, 2].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[0, 2].set_title(f'Strong Negative Linear\nr = {np.corrcoef(x, y_strong_neg)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[0, 2].grid(True, alpha=0.3)

# No Correlation
axes[1, 0].scatter(x, y_no_corr, alpha=0.6, s=80, color='gray')
z = np.polyfit(x, y_no_corr, 1)
axes[1, 0].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[1, 0].set_title(f'No Correlation\nr = {np.corrcoef(x, y_no_corr)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Non-linear
axes[1, 1].scatter(x, y_nonlinear, alpha=0.6, s=80, color='purple')
z = np.polyfit(x, y_nonlinear, 2)
axes[1, 1].plot(x, np.poly1d(z)(x), 'r--', linewidth=2)
axes[1, 1].set_title(f'Non-linear (Quadratic)\nr = {np.corrcoef(x, y_nonlinear)[0,1]:.2f}', 
                     fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

# Summary text
axes[1, 2].axis('off')
summary_text = """
Correlation Summary:

DIRECTION:
â€¢ Positive: Both variables increase
â€¢ Negative: One increases, other decreases
â€¢ None: No clear relationship

STRENGTH (|r| value):
â€¢ 0.8 - 1.0: Very Strong
â€¢ 0.6 - 0.8: Strong
â€¢ 0.4 - 0.6: Moderate
â€¢ 0.2 - 0.4: Weak
â€¢ 0.0 - 0.2: Very Weak/None

FORM:
â€¢ Linear: Straight line pattern
â€¢ Non-linear: Curved pattern
"""
axes[1, 2].text(0.1, 0.5, summary_text, fontsize=11, 
                family='monospace', verticalalignment='center')

plt.tight_layout()
plt.show()

<details style="background-color: #f0f7ff; padding: 15px; border-radius: 8px; border: 1px solid #d0e7ff; margin-top: 10px;">
<summary style="cursor: pointer; font-weight: bold; color: #0056b3; font-size: 1.1em;">ðŸ’¡ Summary - Click to expand</summary>
<div style="margin-top: 10px; padding: 10px; line-height: 1.6;">
This visualization demonstrates the three key aspects of correlation:
<ol style="margin-top: 10px; line-height: 1.8;">
<li><strong>Direction:</strong> Positive (upward), negative (downward), or none (flat)</li>
<li><strong>Strength:</strong> How closely points follow the trend line (strong correlations have tightly clustered points)</li>

<li><strong>Form:</strong> Linear (straight line) or non-linear (curved pattern)</li></details>

</ol></div>
The correlation coefficient (r) ranges from <strong>-1 to +1</strong>, where values closer to Â±1 indicate stronger correlations.

## 9. Summary: Key Takeaways

### Descriptive Statistics Concepts

In [None]:
# Create a summary visualization
summary_data = {
    'Concept': ['Mean', 'Median', 'Mode', 'Pie Chart', 'Bar Chart', 'Line Chart', 'Scatter Plot'],
    'Use Case': [
        'Average value (no outliers)',
        'Typical value (with outliers)',
        'Most frequent value',
        'Part-to-whole (â‰¤4 categories)',
        'Compare categories',
        'Trends over time',
        'Correlation between variables'
    ],
    'Category': ['Statistics', 'Statistics', 'Statistics', 'Visualization', 
                 'Visualization', 'Visualization', 'Visualization']
}

df_summary = pd.DataFrame(summary_data)

print("="*80)
print("DESCRIPTIVE STATISTICS - KEY CONCEPTS SUMMARY")
print("="*80)
print("\n")
print(df_summary.to_string(index=False))
print("\n")
print("="*80)
print("IMPORTANT REMINDERS:")
print("="*80)
print("1. Always check for outliers before calculating the mean")
print("2. Use median when data contains outliers")
print("3. Use mode for nominal (categorical) data")
print("4. Choose the right chart type for your data and question")
print("5. Correlation does not imply causation")
print("6. Always visualize data to identify patterns and anomalies")
print("="*80)

**Summary:** This notebook covered the fundamental concepts of descriptive statistics:

- **Descriptive Statistics**: Mean, median, and mode help summarize data with a single representative value
- **Data Cleaning**: Identify and handle outliers to ensure accurate analysis
- **Visualizations**: Different chart types serve different purposes:
  - Pie charts for part-to-whole relationships
  - Bar charts for comparing categories
  - Line charts for showing trends over time
  - Scatter plots for identifying correlations
- **Correlation Analysis**: Understanding direction, form, and strength of relationships between variables

These tools form the foundation for exploratory data analysis and statistical inference.