<a href="https://colab.research.google.com/github/kthing1/Data110-32008--Sp25/blob/main/Week2_class_sp25_V0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Working with Dataset Files (CSV)

There are two main ways to load data into your notebook:

1. **Direct Upload**: Upload a CSV file directly to Google Colab
   - Good for small to medium files
   - You'll need to re-upload if you close and reopen Colab

2. **GitHub URL**: Use a direct link to the raw file on GitHub
   - Works for any file size
   - No need to upload files manually
   - Links remain stable

For this tutorial, we'll be working with 'happiness_2017.csv'.

In [None]:

df=pd.read_csv("https://raw.githubusercontent.com/Reben80/Data110-32008--Sp25/refs/heads/main/dataset/happiness_2017.csv")
# or

#in case you have the csv file already upploaded to the google colab directory ( left side panel) remember if you close and come back another day this you still need to uplad this agaion
#df=pd.read_csv("/content/happiness_2017.csv")





# Exploring Your Dataset

When working with a new dataset, there are several essential steps to understand your data:

1. **View Sample Data**: Use `df.head()` to see the first few rows
   - `df.head()` shows first 5 rows
   - `df.head(10)` shows first 10 rows
   - `df.tail()` shows last 5 rows

This gives you a quick preview of what your data looks like.

In [None]:
df.head(10)

In [None]:
df.tail()

# Understanding Your Data Structure

`df.info()` is a powerful command that tells you:
- How many rows and columns you have
- The name of each column
- The data type of each column (int, float, string, etc.)
- How many non-null values exist
- How much memory your data is using

This is crucial for identifying missing data and understanding your dataset's structure.

In [None]:
df.info()

# Statistical Summary

For numerical data, `df.describe()` provides key statistics:
- count: number of non-null values
- mean: average value
- std: standard deviation
- min: minimum value
- 25%, 50%, 75%: quartile values
- max: maximum value

This helps you understand the distribution of your numerical data.

In [None]:
df.describe()

# Working with Column Names

A helpful tip: Instead of typing column names manually (which can lead to errors), you can:
1. Use `print(df.columns)` to see all column names
2. Copy and paste the exact column names you need
3. This prevents typos that could cause your code to fail

In [None]:
print(df.columns)

# Creating Scatter Plots

Now we'll learn how to create scatter plots using matplotlib. A scatter plot is perfect for showing relationships between two variables.

Basic syntax: `plt.scatter(x_data, y_data)`

We'll improve our plots step by step:
1. Start with a basic plot
2. Add proper sizing
3. Include labels and titles
4. Customize the appearance

In [None]:
plt.scatter(df['Rank'],df['HappinessScore'])

# Customizing Plot Size

The figure size determines how large your plot will appear:
- `plt.figure(figsize=(width, height))`
- Width and height are in inches
- Common sizes: (10,6), (16,10)
- Larger figures are better for presentations
- Smaller figures work well for documents

In [None]:
plt.figure(figsize=(16,10))
plt.scatter(df['Rank'],df['HappinessScore'])

# Essential Plot Components

Every professional plot should include:
1. **X-axis label**: `plt.xlabel('Label Name')`
2. **Y-axis label**: `plt.ylabel('Label Name')`
3. **Title**: `plt.title('Your Title')`
4. **plt.show()**: Always end with this to display the plot cleanly

These elements help others understand your visualization immediately.

In [None]:
plt.figure(figsize=(16,10))
plt.scatter(df['Rank'],df['HappinessScore'])
plt.xlabel('Rank')
plt.ylabel('Happiness Score')
plt.title(" Rank vs Hapiness")
plt.show()

# info about scatter plot https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

# Advanced Scatter Plot Customization

Let's create a professional-looking scatter plot with all the important customization options:

**Key Parameters:**
- `color`: Choose point color ('blue', 'red', etc.)
- `marker`: Point shape ('o' for circle, 's' for square, '^' for triangle)
- `edgecolors`: Outline color of points
- `alpha`: Transparency (0 to 1)
- `s`: Point size

**Styling Elements:**
- `fontsize`: Control text size
- `fontweight`: Make text bold
- `grid`: Add background grid
- `xticks/yticks`: Customize axis numbers

Below is a complete example using these parameters:

In [None]:
# Explanation of Scatter Plot Parameters


# Set the figure size
plt.figure(figsize=(16,10))  # 16 inches wide and 10 inches tall for better readability

# Scatter plot with styling
plt.scatter(df['Rank'], df['HappinessScore'],
            color='blue',         # Sets marker color to blue
            marker='o',           # Uses circular markers
            edgecolors='black',    # Adds a black outline to markers
            alpha=0.75,           # Makes points slightly transparent (75% opacity)
            s=50)                # change marker size

# X and Y axis labels with styling
plt.xlabel('Rank', fontsize=14, fontweight='bold')  # Bold and larger font for readability
plt.ylabel('Happiness Score', fontsize=14, fontweight='bold')

# Title of the plot
plt.title("Rank vs Happiness Score", fontsize=18, fontweight='bold')  # Larger and bold title

# Grid for better readability
plt.grid(True, linestyle='--', alpha=0.5)  # Dashed grid lines with slight transparency

# Adjust tick label size
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Invert x-axis if lower rank means better happiness
plt.gca().invert_xaxis()

# Show the plot
plt.show()


# Styling Your Plots

Matplotlib offers pre-built styles to make your plots look professional:
1. Check available styles with `plt.style.available`
2. Apply a style with `plt.style.use('style_name')`
3. Popular styles include:
   - 'ggplot': Clean, professional look (from R)
   - 'seaborn': Modern, attractive defaults
   - 'classic': Traditional matplotlib style

Try different styles to find what works best for your presentation!

In [None]:
plt.style.available

In [None]:
plt.style.use('ggplot')

In [None]:
plt.figure(figsize=(16,10))
plt.scatter(df['Rank'],df['HappinessScore'])
plt.xlabel('Rank')
plt.ylabel('Happiness Score')
plt.title(" Rank vs Hapiness")
plt.show()

a good place to learn more about how to style your scatter plot, it the offical website of [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html). check it our and try to experment with some of the setting.

# Practice Assignments 📊

Let's practice creating scatter plots using different variables from our happiness dataset!

### Assignment 1: Basic Scatter Plot
Create a scatter plot showing the relationship between 'Log GDP per capita' and 'HappinessScore'.
- Use appropriate axis labels
- Add a title
- Set figure size to (12,8)

### Assignment 2: Styled Scatter Plot
Create a scatter plot comparing 'Social support' vs 'Healthy life expectancy at birth'.
- Use red markers with black edges
- Set marker size to 100
- Add a grid
- Make markers semi-transparent (alpha=0.6)

### Assignment 3: Advanced Visualization
Create a scatter plot showing 'Freedom to make life choices' vs 'Positive affect'.
- Use triangle markers ('^')
- Make the plot blue with yellow edges
- Add bold labels
- Include a grid with dashed lines

### Bonus Challenge 🌟
Create a scatter plot comparing any two variables of your choice, but:
- Use a style from `plt.style.available`
- Add custom font sizes for labels
- Include a brief interpretation of what the plot shows



In [None]:
# Your code should be here

--------------------------------

###  Anscombe's Quartet Dataset

This dataset is known as **Anscombe's Quartet**, created by statistician Francis Anscombe to illustrate the importance of visualizing data. Despite having nearly identical statistical properties (e.g., mean, variance, correlation, and linear regression), each dataset tells a very different story when graphed.

- **x**: The independent variable, common across three datasets.
- **y1, y2, y3**: Three different dependent variables associated with the same `x` values.
- **x4, y4**: A special case where most of the `x` values are identical, with one outlier.

#### Anscombe's Quartet:

In [None]:
# Anscombe's Quartet:
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

In [None]:
plt.scatter(x, y1)

Lets also do the linear regression for this dataset, do not worry about the code for now, just focus on the output. and we will be back to this code later.

In [None]:
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x, y1, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x, y1)
plt.plot(x, regression_line,color='blue')
plt.xlabel('x')
plt.ylabel('y1')


### Assignment 4: Anscombe's Quartet and Linear Regression

Perform the same linear regression process for the following datasets: y2, y3, and y4. Modify the code to calculate and plot the regression lines for each of these datasets. Use distinct colors for each plot and appropriately label the axes (y2, y3, etc.). Discuss any differences you observe when comparing the results across all datasets.

In [None]:
#Your code should be here