# Day 1: Introduction to Data Analysis with Python

Welcome to your first day of data analysis! In this notebook, we'll explore the fundamentals of working with data using Python's most powerful libraries for data science.

## Learning Objectives

By the end of this session, you will be able to:
- Import and use essential data science libraries
- Load datasets from CSV files
- Understand the difference between wide and long data formats
- Transform data between different formats using pandas

---



## Step 1: Importing Essential Libraries

Before we can work with data, we need to import the libraries that give us the tools we need.

To do so, use
```python
import library_name as alias
```

Think of these as specialized toolkits:

- **pandas** (`pd`): The powerhouse for data manipulation and analysis. It provides DataFrames, which are like supercharged spreadsheets in Python.
- **matplotlib.pyplot** (`plt`): A comprehensive library for creating static, animated, and interactive visualizations.
- **seaborn** (`sns`): Built on top of matplotlib, it provides a high-level interface for drawing attractive statistical graphics.

### Why Use Aliases?

We use short aliases (like `pd`, `plt`, `sns`) to make our code cleaner and follow community conventions. This is a standard practice that makes your code more readable and easier to share with other data scientists.

**Best Practice:** Always import these libraries at the very beginning of your notebook!

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

##Step 2: Loading Your First Dataset
Now that we have our tools ready, it's time to load some real data. We'll be working with the Palmer Penguins dataset,


```python
data =  pd.read_csv('penguins.csv', sep= ',' ,header = 0)
```
The read_csv() function is one of the most commonly used pandas functions.

Let's break down the parameters:

'penguins.csv': The filename or path to your CSV file
sep=',': The delimiter character (comma in this case, which is standard for CSV files)
header=0: Tells pandas that the first row (index 0) contains column names
Pro Tip
Simply typing the DataFrame variable name (like df) at the end of a cell will display the data in a nice, formatted table. This is one of the convenient features of Jupyter notebooks!

In [2]:
header_idx = 0 #First row is column names
df =pd.read_csv('penguins.csv', sep = ',', header=header_idx) # for csv file
df

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,3,Adelie,Torgersen,,,,,
4,4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...,...
339,339,Gentoo,Biscoe,,,,,
340,340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


##Inspect data
Use .columns on your dataframe to see the columns of the dataset.

Before diving into data analysis, it's crucial to understand what information your dataset contains. The `.columns` attribute is one of the first tools you should use when exploring a new dataset.

**What does `.columns` do?**
- Returns an Index object containing all column names in your DataFrame
- Shows you exactly what variables you have available for analysis
- Helps you understand the structure of your data

**Syntax:**
```python
df.columns
```

**Pro Tip:** You can convert this to a list for easier reading:
```python
list(df.columns)
```

In [4]:
list(df.columns)

['Unnamed: 0',
 'species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex']

One of the first questions you should ask when working with a dataset is: "How big is it?" The `.shape` attribute gives you a quick answer by returning the dimensions of your DataFrame.

**What does `.shape` do?**
- Returns a tuple with two values: (number of rows, number of columns)
- Provides an instant overview of your dataset's size
- Helps you understand the scale of data you're working with

**Syntax:**
```python
df.shape
```

**Why is this important?**

1. **Memory Management**: Know if you're working with a small dataset (hundreds of rows) or big data (millions of rows)
2. **Performance Planning**: Large datasets may require different approaches or more processing time
3. **Data Validation**: Verify that your data loaded correctly (e.g., if you expect 1000 rows but only see 100, something went wrong)
4. **Quick Context**: Understand the scope of your analysis before diving in

**Understanding the Output:**
- **First number (rows)**: The number of observations/records in your dataset
- **Second number (columns)**: The number of variables/features you have

**Pro Tip:** You can access individual dimensions:
```python
num_rows = df.shape[0]      # Get number of rows
num_cols = df.shape[1]      # Get number of columns
```



In [5]:
df.shape #(rows, columns)

(344, 8)

## Previewing Your Data with `.head()` and `.tail()`

### Using `df.head()` to View the First Rows

After understanding your dataset's size and column names, the next step is to actually look at the data. The `.head()` method gives you a quick peek at the beginning of your dataset.

**What does `.head()` do?**
- Displays the first few rows of your DataFrame
- By default, shows 5 rows
- Allows you to customize how many rows to display

**Syntax:**
```python
df.head()          # Shows first 5 rows (default)
df.head(10)        # Shows first 10 rows
df.head(3)         # Shows first 3 rows
```


**When to use different numbers:**
- Use `head(3)` for a quick glance
- Use `head(10)` or `head(20)` for a more thorough initial inspection
- Use `head(100)` when you need to see more patterns or variations


In [6]:
df.head() #preview of the first 5 rows
#df.head(10) #preview of the first 10 rows

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,3,Adelie,Torgersen,,,,,
4,4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


`.tail()` works the same way as .head() but shows the **last** rows of your dataset, which is useful for checking if data loaded completely.

Try using `.tail()` with different numbers to explore the penguin dataset!

In [7]:
df.tail()

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
339,339,Gentoo,Biscoe,,,,,
340,340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female
343,343,Gentoo,Biscoe,49.9,16.1,213.0,5400.0,Male


## Statistical Summary with `.describe()`

### Using `.describe()` to Get Quick Statistics

Once you've previewed your data, the next crucial step is understanding its statistical properties. The `.describe()` method is your go-to tool for getting a comprehensive statistical summary of your dataset.

**What does `.describe()` do?**
- Generates descriptive statistics for all numerical columns in your DataFrame
- Provides a quick overview of central tendency, dispersion, and distribution
- Automatically excludes non-numeric columns (like text or categories) by default

**Syntax:**
```python
df.describe()                    # Statistics for numeric columns only
df.describe(include='all')       # Statistics for all columns (numeric and categorical)
df.describe(include=['object'])  # Statistics for categorical columns only
```

**Key Statistics Provided:**

For **numeric columns**, you get:
- **count**: Number of non-null (non-missing) values
- **mean**: Average value
- **std**: Standard deviation (measure of spread)
- **min**: Minimum value
- **25%**: First quartile (25th percentile)
- **50%**: Median (middle value, 50th percentile)
- **75%**: Third quartile (75th percentile)
- **max**: Maximum value



**Pro Tip:** Use `.describe().T` to transpose the output, making it easier to read when you have many columns!

Run `.describe()` on the penguin dataset to get a statistical overview of the measurements!

In [8]:
df.describe() #for statistics summary

Unnamed: 0.1,Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,344.0,342.0,342.0,342.0,342.0
mean,171.5,43.92193,17.15117,200.915205,4201.754386
std,99.448479,5.459584,1.974793,14.061714,801.954536
min,0.0,32.1,13.1,172.0,2700.0
25%,85.75,39.225,15.6,190.0,3550.0
50%,171.5,44.45,17.3,197.0,4050.0
75%,257.25,48.5,18.7,213.0,4750.0
max,343.0,59.6,21.5,231.0,6300.0


## Dataset Overview with `.info()`

### Using `.info()` to Get a Complete Data Summary

While `.describe()` gives you statistics, `.info()` provides a structural overview of your DataFrame. This method is essential for understanding the technical details of your dataset.

**What does `.info()` do?**
- Displays the DataFrame's structure and metadata
- Shows column names, data types, and memory usage
- Reports the count of non-null (non-missing) values for each column
- Provides a quick diagnostic of data completeness

**Syntax:**
```python
df.info()
```

**Key Information Provided:**

1. **RangeIndex**: Total number of rows and the index range
2. **Column Names**: All column names in order
3. **Non-Null Count**: How many valid (non-missing) entries each column has
4. **Dtype**: Data type of each column (int64, float64, object, etc.)
5. **Memory Usage**: How much RAM your DataFrame is consuming

**Why is this important?**

1. **Data Type Verification**: Ensure columns have the correct type (dates as datetime, numbers as numeric, not text)
2. **Missing Data Detection**: Instantly identify which columns have missing values and how many
3. **Memory Management**: Understand memory consumption, especially important for large datasets
4. **Data Cleaning Planning**: Quickly spot which columns need attention before analysis
5. **Type Conversion Needs**: Identify when you need to convert data types (e.g., object to numeric)

In [9]:
df.info() #for overview

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB


## Data types

Type data.body_mass_g. This will return one column showing the values and the type of the values in the end as dtype. Sometimes this is desired to change

In [10]:
#pd.read_csv('data.csv', dtype={'body_mass_g': float})
df.body_mass_g

Unnamed: 0,body_mass_g
0,3750.0
1,3800.0
2,3250.0
3,
4,3450.0
...,...
339,
340,4850.0
341,5750.0
342,5200.0


## Converting Data Types with `.astype()`

### Changing Column Data Types

Sometimes pandas doesn't automatically assign the correct data type to your columns, or you need to convert data for specific analyses. The `.astype()` method allows you to explicitly change a column's data type.

**What does `.astype()` do?**
- Converts a column from one data type to another
- Forces pandas to interpret the data in a specific way
- Can handle various type conversions (numeric, string, categorical, etc.)

**Syntax:**
```python
df.column_name.astype('Int64')
```

**Understanding 'Int64' vs 'int64':**

This is a crucial distinction that many beginners miss!

**`int64` (lowercase):**
- Standard integer type in NumPy/pandas
- **Cannot handle missing values (NaN)**
- If a column has any NaN values, conversion will fail or convert to float64
- Fast and memory-efficient

**`Int64` (uppercase - note the capital I):**
- Nullable integer type introduced in pandas
- **Can handle missing values while remaining an integer type**
- Missing values are represented as `<NA>` instead of NaN
- Slightly more memory overhead but much more flexible

**Why is this important?**

1. **Preserve Data Integrity**: Keep integer columns as integers even with missing values
2. **Accurate Calculations**: Ensure mathematical operations work correctly
3. **Proper Aggregations**: Avoid issues when grouping or summarizing data
4. **Type Safety**: Prevent unexpected behavior in analysis

**What to Observe in the dtype Field:**

After running the conversion, look at the dtype field:
- **Before conversion**: Might show `float64` (because of NaN values) or `object` (if mixed types)
- **After conversion to Int64**: Shows `Int64`
- **Key observation**: The capital 'I' indicates this is a nullable integer type that can coexist with missing values

**Common Type Conversions:**
```python
df.column_name.astype('float64')    # Convert to decimal numbers
df.column_name.astype('str')        # Convert to text/string
df.column_name.astype('datetime64') # Convert to date/time
df.column_name.astype('bool')       # Convert to True/False
```

**Handling Conversion Errors:**
```python
df.column_name.astype('Int64', errors='ignore')  # Skip if conversion fails
df.column_name.astype('Int64', errors='raise')   # Raise error if fails (default)
```

**Pro Tip:** Always check your data types with `df.dtypes` or `df.info()` after loading data. Incorrect types are a common source of bugs in data analysis!

Try converting a numeric column to 'Int64' and observe how the dtype changes from float64 to Int64, allowing integers to coexist with missing values

In [11]:
df.flipper_length_mm.astype('Int64')


Unnamed: 0,flipper_length_mm
0,181
1,186
2,195
3,
4,193
...,...
339,
340,215
341,222
342,212


There is also the option of downcast

In [12]:
print(df['flipper_length_mm'].dtype)
df['flipper_length_mm'] = pd.to_numeric(df['flipper_length_mm'], errors='coerce', downcast='float')
print(df['flipper_length_mm'].dtype)

float64
float32


##Handling NA values

## Detecting Missing Values with `.isnull().sum()`

### Using `.isnull().sum()` to Count Missing Data

Missing data is one of the most common challenges in data analysis. Before you can analyze or visualize your data, you need to know where the gaps are. The `.isnull().sum()` method provides a clear count of missing values in each column.

**What does `.isnull().sum()` do?**
- Identifies all null/missing values (NaN) in your DataFrame
- Counts how many missing values exist in each column
- Returns a Series showing the count for every column

**Syntax:**
```python
df.isnull().sum()          # Count of null values per column
df.isnull().sum().sum()    # Total null values in entire DataFrame
```

**How it works (step by step):**
1. **`.isnull()`**: Creates a DataFrame of True/False values (True where data is missing)
2. **`.sum()`**: Counts the True values (missing entries) for each column

**Why is this important?**

1. **Data Quality Assessment**: Understand the completeness of your dataset
2. **Analysis Planning**: Decide how to handle missing data before proceeding
3. **Pattern Recognition**: Some columns might have more missing values than others
4. **Decision Making**: Determine if you should drop rows, fill values, or keep them as-is
5. **Avoid Errors**: Many statistical functions fail or give incorrect results with missing data


**Complementary Functions:**
```python
df.isnull().any()          # Boolean: Does each column have ANY missing values?
df.isnull().sum() / len(df) # Percentage of missing values per column
```

In [13]:
df.isnull().sum()  # Same thing (isna and isnull are aliases)

Unnamed: 0,0
Unnamed: 0,0
species,0
island,0
bill_length_mm,2
bill_depth_mm,2
flipper_length_mm,2
body_mass_g,2
sex,11




Printing the first 10 lines (remember! .head()) we see some of them

In [14]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,3,Adelie,Torgersen,,,,,
4,4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
6,6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
7,7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
8,8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


## Handling Missing Data with `.dropna()`

### Removing Columns with Too Many Missing Values

After identifying missing values, the next step is deciding what to do with them. The `.dropna()` method gives you powerful options for removing missing data based on specific criteria.

**Understanding the Code:**

```python
df_drop = df.dropna(axis=1, thresh= 0)
```

**Breaking Down the Parameters:**

**`axis=1`**:
- To use **columns**, use axis= 1

**`thresh=340`**:
- **Threshold**: Minimum number of non-null values required to keep a column

**Add print() Statements:** print()
- **Before**: To show original DataFrame dimensions (rows, columns), using df.shape
- **After**: To show dimensions after dropping columns with too many missing values, use df.shape again


**Experiment with the threshold**

Consider your dataset size and analysis needs, remember df.info(), can give you a great overview!

**If you have more time, try these as well**
```python
df.dropna(axis=1, how='all')      # Drop columns where ALL values are missing
df.dropna(axis=1, how='any')      # Drop columns with ANY missing values
df.dropna(axis=0, thresh=5)       # Drop rows with fewer than 5 non-null values
df.dropna(subset=['column_name']) # Drop rows with NaN in specific columns
```

**Important Note:** This creates a new DataFrame (`df_drop`) and doesn't modify the original (`df`). To modify in-place, add `inplace=True`.


In [15]:
#df.dropna()
print(f'Initial shape {df.shape}')
df_drop= df.dropna(axis = 1, thresh=340)  # Keep rows with at least 340 non-NA values
print(f'Shape after dropping NaN {df.shape}')

Initial shape (344, 8)
Shape after dropping NaN (344, 8)


Let's have a look at the df['bill_length_mm'] again, do you see some NaNs?

In [16]:
df['bill_length_mm']

Unnamed: 0,bill_length_mm
0,39.1
1,39.5
2,40.3
3,
4,36.7
...,...
339,
340,46.8
341,50.4
342,45.2


This time, instead of dropping these values, you will fill it with the median. To do so, use the function fillna() and inside place the two arguments: *inplace=True* to perform the filling in that column, and *df['bill_length_mm'].median()* to replace NaN with the median of that column.


## Filling Missing Values with `.fillna()`

### Replacing NaN Values Instead of Dropping Them

Sometimes dropping missing data isn't the best solution—you might lose too much valuable information. Instead, you can **fill** (or **impute**) missing values with reasonable substitutes. The `.fillna()` method gives you flexible options for replacing NaN values.

**What does `.fillna()` do?**
- Replaces all NaN (missing) values in a column or DataFrame
- Allows you to specify what value should replace the missing data
- Can modify the original DataFrame or create a new one

**Syntax:**
```python
df['bill_length_mm'].fillna(df['bill_length_mm'].median(), inplace=True)
```

**Breaking Down the Parameters:**

**`df['bill_length_mm'].median()`:**
- Calculates the median (middle value) of the column, ignoring NaN values
- Uses this calculated value to replace all missing entries
- The median is computed from the existing valid data

**`inplace=True`:**
- Modifies the original DataFrame directly
- No need to reassign: `df['column'] = df['column'].fillna(...)`
- Changes are permanent (unless you reload the data)
- **Without this parameter**, fillna() returns a new Series/DataFrame without modifying the original

### Experiment with other values as well,  you can use mean instead of mean (df['bill_length_mm'].mean()) or a single value such as 0

In [17]:
df['bill_length_mm'].fillna(df['bill_length_mm'].mean(), inplace=True)
df['bill_length_mm']

Unnamed: 0,bill_length_mm
0,39.10000
1,39.50000
2,40.30000
3,43.92193
4,36.70000
...,...
339,43.92193
340,46.80000
341,50.40000
342,45.20000
