# Lab 3: Reading and Plotting Data with pandas and matplotlib

In this lab, you will learn how to:
- Load CSV data using `pandas`
- Inspect and summarize datasets
- Plot simple graphs using `matplotlib`

These tools are essential for analyzing agricultural data such as temperature, humidity, and CO₂ levels from sensors.

## What is pandas?
**pandas** is a Python library used for working with tabular data (like spreadsheets or CSV files).

With pandas, you can:
- Load and save data
- Explore and clean datasets
- Analyze and filter rows and columns
- Perform group operations and aggregations

## What is matplotlib?
**matplotlib** is a Python library for creating static, animated, and interactive visualizations.

It works well with pandas and is widely used in science and engineering to visualize trends, comparisons, and patterns.

## 1. Loading Data with pandas

In [None]:
# Run this cell if pandas is not already installed
!pip install pandas



## Reading CSV Files from Local Drive or Google Drive
In Google Colab, you can read CSV files from your local machine or Google Drive.

### Option 1: Upload from Your Local Computer

First, download an excel file to your local machine (e.g., your laptop).  
https://github.com/nomurako/Lab_notebook/blob/master/example_strawberry_photosynthesis.xlsx  

The file name "example_strawberry_photosynthesis.xlsx" should be maintained (don't rename it).

In [9]:
#. This will open a file upload dialog in Colab.
#. Choose the downloaded file.
from google.colab import files
uploaded = files.upload()

#. After uploading, you can read it like this:
import pandas as pd
df = pd.read_excel("example_strawberry_photosynthesis.xlsx")
df.head()

Saving example_strawberry_photosynthesis.xlsx to example_strawberry_photosynthesis (4).xlsx


Unnamed: 0,obs,time,elapsed,date,hhmmss,averaging,TIME,E,Emm,A,...,Ts,Tr,CO2_%,Desiccant_%,Humidifier_%,Txchg_sp,CO2_r_sp,H2O_r_sp,SS_s,SS_r
0,17,1681103000.0,15098.4,20230410 14:05:23,14:05:23,15,1681103000.0,0.001347,1.347312,12.50235,...,30.4002,30.3621,24.8436,41.4688,0,21.1353,400,18,103.414,101.319
1,18,1681104000.0,16019.5,20230410 14:20:44,14:20:44,15,1681104000.0,0.001366,1.366296,11.986033,...,30.0397,29.9935,24.8671,40.0914,0,21.8435,400,18,103.427,101.345
2,19,1681105000.0,16941.5,20230410 14:36:06,14:36:06,15,1681105000.0,0.001476,1.475873,10.908806,...,28.9674,28.9444,24.8798,38.7121,0,23.4226,400,18,103.605,101.588
3,20,1681106000.0,17865.4,20230410 14:51:30,14:51:30,15,1681106000.0,0.001159,1.158864,7.768794,...,28.9965,28.9598,24.8756,37.3201,0,23.238,400,18,103.535,101.5
4,21,1681107000.0,18783.4,20230410 15:06:48,15:06:48,15,1681107000.0,0.000654,0.653734,2.714961,...,28.383,28.3604,24.9124,35.4063,0,23.8365,400,18,103.655,101.674


### Option 2: Access from Google Drive
Mount your Google Drive and navigate to the file location.

In [11]:
# #. Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# #. Example path to a CSV file in your Drive (adjust the path accordingly)
# file_path = '/content/drive/MyDrive/your_folder/your_file.csv'
# df = pd.read_csv(file_path)
# df.head()

## What is a DataFrame and a Series?
Once you load your CSV file, it becomes a **DataFrame**.

### 1. DataFrame
- A **DataFrame** is a 2D table with rows and columns
- Think of it as a spreadsheet in memory
- Each column can have a different data type

### 2. Series
- A **Series** is a single column from a DataFrame
- It is 1-dimensional and includes the row index

### Example:
```python
df['Temperature']          # This returns a Series
df[['Temperature']]        # This returns a DataFrame
```

## 2. Exploring the Data

In [None]:
# Basic information and summary statistics
df.info()
df.describe()

## 4. Modifying Columns in pandas

In [None]:
# Rename columns
df.columns = ['Month', '1958', '1959', '1960']
df.head()

In [None]:
# Add a new column (e.g., total passengers)
df['Total'] = df[['1958', '1959', '1960']].sum(axis=1)
df.head()

In [None]:
# Delete a column
df = df.drop(columns=['Total'])
df.head()

### What’s the Difference Between `loc[]` and `iloc[]`?
- `.loc[]` is label-based indexing. Use it when you know the **row label** or column name.
- `.iloc[]` is integer-location-based indexing. Use it when you know the **row or column position**.

**Examples:**
```python
df.loc[0]          # Gets the row with label/index 0
df.iloc[0, 1:3]    # Gets the first row, columns 2 and 3
```

In [None]:
# Access rows and columns using loc and iloc
print(df.loc[0])       # first row by label
print(df.iloc[0, 1:])  # first row, columns 2 to end

## Understanding Rows, Columns, and Index in pandas
- A **row** represents one record or observation (e.g., a single measurement at a certain time).
- A **column** represents a variable or feature (e.g., temperature, humidity).
- The **index** is a label for each row. By default, it’s an integer (0, 1, 2, ...), but you can set it to any unique identifier (like a date or timestamp).

**Example:**
```python
df.index            # shows index values
df.columns          # shows column names
df.shape            # shows (number of rows, number of columns)
```

In [None]:
# Display index, columns, and shape
print('Index:', df.index)
print('Columns:', df.columns)
print('Shape:', df.shape)

## Working with Dates and Times in pandas
Sensor data is often recorded over time, so handling **datetime** is important.

**Steps to handle datetime:**
- Convert a column to datetime: `pd.to_datetime(df['column_name'])`
- Set datetime as index: `df.set_index('datetime_column')`
- Extract parts like year, month, hour using `.dt` accessor

**Example:**

In [None]:
#. Create a datetime range starting from Jan 1, 2022 with monthly frequency
df['Date'] = pd.date_range(start='2022-01-01', periods=len(df), freq='M')
#. Convert to datetime format (not strictly needed here since it's already datetime)
df['Date'] = pd.to_datetime(df['Date'])
#. Set the datetime column as the index
df.set_index('Date', inplace=True)
#. Extract the month number from the datetime index
df['Month_Num'] = df.index.month
df.head()

### Understanding Data Types in pandas
Each column in a DataFrame has a **data type** (dtype), such as:
- `int64`: integer numbers
- `float64`: decimal numbers
- `object`: text (string)
- `datetime64[ns]`: datetime

**Check datatypes:**
```python
df.dtypes
```

### Resampling Data
If your data is recorded at a higher frequency (e.g., hourly), you can **resample** it to daily, weekly, or monthly summaries.

**Example: Get monthly mean values**

In [None]:
#. Example resampling (assuming datetime index and numerical columns)
monthly_avg = df.resample('M').mean()
monthly_avg.head()

## 3. Plotting with matplotlib

In [None]:
# Run this cell if matplotlib is not already installed
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt

# Basic line plot example (assumes numerical data)
df.plot(x='Month')
plt.title('Air Travel by Month')
plt.xlabel('Month')
plt.ylabel('Passengers')
plt.grid(True)
plt.show()

## 5. Other Plotting Styles

In [None]:
# Scatter plot example
df.plot(kind='scatter', x='1958', y='1959')
plt.title('Scatter Plot: 1958 vs 1959')
plt.grid(True)
plt.show()

In [None]:
# Bar chart example
df.set_index('Month')[['1958', '1959', '1960']].plot(kind='bar')
plt.title('Monthly Air Travel by Year')
plt.ylabel('Passengers')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Error bar example
import numpy as np
means = df[['1958', '1959', '1960']].mean()
stds = df[['1958', '1959', '1960']].std()
plt.errorbar(means.index, means.values, yerr=stds.values, fmt='o', capsize=5)
plt.title('Average Monthly Passengers with Std Dev')
plt.grid(True)
plt.show()

## ✍️ Exercise: Try it Yourself
- Replace the dataset with one of your own (e.g., environmental sensor logs)
- Try plotting temperature or humidity data over time
- Customize the title, axis labels, and line style

## Summary
In this lab, you learned how to:
- Read CSV data using `pandas`
- Explore and summarize datasets
- Plot basic graphs using `matplotlib`

In the next lab, we'll explore data cleaning and simple analysis techniques.