# 2D Plotting with Matplotlib
## Milestone 1: Extract Correct Data

### 1. Import libraries

In [1]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

In [2]:
print("Pandas", pd.__version__)
print("Matplotlib", mpl.__version__)

Pandas 2.2.2
Matplotlib 3.9.2


In [3]:
# Define the location of the data folder 
data_loc = "./data/"

### 2. Read data 
from the URL of the SeoulBikeData.csv file into a pandas DataFrame.

In [4]:
# Read the Seoul datafile
# Parse the `Date` column as a real datetime for reliable filtering later
# Handle non‑ASCII characters via `encoding='unicode_escape'
data = pd.read_csv(data_loc+"SeoulBikeData.csv",  
                    encoding = 'unicode_escape',   
                    parse_dates=['Date'],
                    date_format = "%d/%m/%Y"
                  )

The **`pandas.DataFrame.info()`** method provides a concise summary of a DataFrame, acting like a quick “report card” for your dataset.    
It displays:    
- the number of rows and columns
- the column names
- each column’s data type (e.g., integer, float, object/string, datetime)
- the count of non-null (non-missing) values per column
- the approximate memory usage.    

This is particularly useful during exploratory data analysis because it helps you quickly check for missing values, verify data types before analysis or visualization, and understand the overall structure of your dataset without printing all rows.
In short, `.info()` is a diagnostic tool that gives you a snapshot of your dataset’s health and readiness for further processing.

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Date                       8760 non-null   datetime64[ns]
 1   Rented Bike Count          8760 non-null   int64         
 2   Hour                       8760 non-null   int64         
 3   Temperature(°C)            8760 non-null   float64       
 4   Humidity(%)                8760 non-null   int64         
 5   Wind speed (m/s)           8760 non-null   float64       
 6   Visibility (10m)           8760 non-null   int64         
 7   Dew point temperature(°C)  8760 non-null   float64       
 8   Solar Radiation (MJ/m2)    8760 non-null   float64       
 9   Rainfall(mm)               8760 non-null   float64       
 10  Snowfall (cm)              8760 non-null   float64       
 11  Seasons                    8760 non-null   object        
 12  Holida

In [6]:
pd.Timestamp.today()

Timestamp('2025-08-25 18:03:48.887252')

### 3. Do preliminary data exploration on your DataFrame.

The **`pandas.DataFrame.shape`** attribute (note: it’s not a method, so you don’t call it with parentheses) gives you the dimensions of your DataFrame as a tuple `(rows, columns)`.     
The first number tells you how many rows (observations) your dataset has, and the second tells you how many columns (features/variables).
For example, a shape of `(365, 10)` means 365 rows and 10 columns.    
This is especially useful for quickly confirming that you’ve loaded the dataset correctly, checking the size after filtering or subsetting, and verifying that your data transformations had the intended effect. In short, `.shape` gives you an at-a-glance view of your dataset’s size, which is often the very first check analysts perform before diving deeper.

In [7]:
# Check the number of rows and columns in your data
data.shape

(8760, 14)

The **`pandas.DataFrame.columns`** attribute returns an index object containing the labels (names) of all the columns in your DataFrame. In other words, it tells you *what variables you have available* in your dataset. For example, the dataset of bike rentals shows `Index(['Date',  'Rented Bike Count', 'Hour', 'Temperature(°C)', ...], dtype='object')`. This is especially useful when you want to:

* Verify that the dataset has the expected columns,
* Spot issues such as typos, extra spaces, or unusual characters in column names,
* Know the exact spelling and case of a column before selecting or filtering it.

In short, `.columns` is your quick reference list of all variable names in a DataFrame, helping you orient yourself before performing selections, transformations, or visualizations.


In [8]:
# Check the column names
data.columns

Index(['Date', 'Rented Bike Count', 'Hour', 'Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons',
       'Holiday', 'Functioning Day'],
      dtype='object')

The **`pandas.DataFrame.head()`** method displays the first few rows of your DataFrame (by default, the first 5). It’s like peeking at the top of your dataset to make sure everything looks right. You can also pass an integer to see a different number of rows, for example, `.head(10)` to see the first 10.    
This is especially helpful right after loading a dataset, because it lets you quickly check whether the file was read correctly, confirm that dates, numbers, and text were parsed properly, and get an immediate sense of what the data looks like without overwhelming yourself with thousands of rows. In short, `.head()` is your go-to tool for a quick first glance at the contents of your DataFrame.


In [9]:
# Check the contents of the first 5 rows of the DataFrame
data.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


In [10]:
# show the dataframe information 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Date                       8760 non-null   datetime64[ns]
 1   Rented Bike Count          8760 non-null   int64         
 2   Hour                       8760 non-null   int64         
 3   Temperature(°C)            8760 non-null   float64       
 4   Humidity(%)                8760 non-null   int64         
 5   Wind speed (m/s)           8760 non-null   float64       
 6   Visibility (10m)           8760 non-null   int64         
 7   Dew point temperature(°C)  8760 non-null   float64       
 8   Solar Radiation (MJ/m2)    8760 non-null   float64       
 9   Rainfall(mm)               8760 non-null   float64       
 10  Snowfall (cm)              8760 non-null   float64       
 11  Seasons                    8760 non-null   object        
 12  Holida

### 4. Extract only the specific columns
 ["Date", "Hour", “Rented Bike Count”] from the DataFrame.

The **`pandas.DataFrame.loc[]`** indexer is used to **select rows and columns by their labels (names)** rather than by numerical positions. It works with explicit labels, boolean conditions, and even slices. For example:

* `df.loc[5]` → retrieves the row with index label `5`.
* `df.loc[:, ["Date", "Hour"]]` → selects all rows but only the *Date* and *Hour* columns.
* `df.loc[df["Hour"] >= 8]` → filters rows where the *Hour* column is greater than or equal to 8.
* `df.loc[(df["Date"] == "2017-08-12") & (df["Hour"].between(8, 20)), ["Date", "Hour", "Rented Bike Count"]]` → retrieves only rows matching both conditions and only the columns *Date*, *Hour*, and *Rented Bike Count*.

This makes `.loc[]` one of the most powerful tools in pandas, because it lets you extract exactly the subset of data you want in a very readable way. In short, `.loc[]` is your “label-based scalpel” for slicing and filtering DataFrames with precision.


In [11]:
# Assign the extracted columns to a new DataFrame
all_bikes = data.loc[:,["Date", "Hour", "Rented Bike Count"]]

In [12]:
all_bikes.head()

Unnamed: 0,Date,Hour,Rented Bike Count
0,2017-12-01,0,254
1,2017-12-01,1,204
2,2017-12-01,2,173
3,2017-12-01,3,107
4,2017-12-01,4,78


The **`pandas.DataFrame.to_csv()`** method is used to save a DataFrame as a CSV (comma-separated values) file on your computer or another location. By default, it writes the entire DataFrame, including the index, into a text file where each row corresponds to a line and each column is separated by a comma. You can customize its behavior with several useful arguments:

* `index=False` → prevents pandas from writing the row index as an extra column.
* `sep=";"` → changes the delimiter (useful for European CSV formats that prefer semicolons).
* `encoding="utf-8"` or `"latin-1"` → controls how special characters are saved.
* `columns=[...]` → lets you choose only specific columns to export.

Example:

```python
df.to_csv("cleaned_data.csv", index=False, encoding="utf-8")
```

This saves your DataFrame to a file named `cleaned_data.csv` without adding the index column.

In short, `.to_csv()` is your go-to method for exporting processed or cleaned datasets so you can share them, reload them later, or use them in other tools like Excel.


In [13]:
all_bikes.to_csv(data_loc+"all_bikes.csv", index = False)

### 5. Extract only the required rows
—those pertaining to the date we want to plot (2017-08-12)—and specific hours on that date, for example, 8 a.m. to 8 p.m., from the DataFrame you created in Step 4.

In [14]:
# Create a boolean mask (a series of True/False values)
# It checks each row in the Date column to see if it matches "2017-12-08".
# Rows with that date get True; all others get False.
mask_date = (all_bikes.Date == "2017-12-08")
# Create another boolean mask
# It checks if the value in the Hour column is between 8 and 20 inclusive
# The & operator combines two conditions: greater than or equal to 8 and less than or equal to 20.
mask_hour = (all_bikes.Hour >= 8) & (all_bikes.Hour <= 20)
# The two masks are combined with & (logical AND).
# Only rows where both conditions are True are selected.
# The result is assigned to bikes, which is a filtered DataFrame containing:
# Only rows from December 8, 2017 and only the hours between 08:00 and 20:00
bikes = all_bikes[mask_date & mask_hour]

In [15]:
bikes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, 176 to 188
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date               13 non-null     datetime64[ns]
 1   Hour               13 non-null     int64         
 2   Rented Bike Count  13 non-null     int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 416.0 bytes


In [16]:
bikes

Unnamed: 0,Date,Hour,Rented Bike Count
176,2017-12-08,8,780
177,2017-12-08,9,395
178,2017-12-08,10,261
179,2017-12-08,11,310
180,2017-12-08,12,355
181,2017-12-08,13,354
182,2017-12-08,14,350
183,2017-12-08,15,362
184,2017-12-08,16,401
185,2017-12-08,17,500


In [17]:
bikes.to_csv(data_loc+"bikes.csv", index = False)