# Introduction to Data Processing With Python 



- Introduction
- [Read CSV Data](#read-CSV-data) 
    - Import `pandas`
    - Read CSV data with `pandas`
    - Inspect a `pandas` data frame
    - Add parameters to read data properly
    - Rename columns/variables
    - Exercise
- [Tidy Data](#tidy-data) @Aurora
    - Obervations and variables
    - Melt messy data to create tidy data
    - Visualizations
    - Exercise
- [Process Data](#process-data) 
    - Handle missing values 
    - Select variables
    - Combine variables
    - Filter observations (rows)
    - Sort observations (rows)
    - Select variables (columns)
    - Exercise
    - Assign new variables (columns) @Aurora
    - Custom lambda functions on assign @Aurora
    - Exercise @Aurora
- [Aggregate Data](#aggregate-data)
    - Date columns
    - Group by common values
    - Aggregations: sum, mean, first, median, count
    - Looping through large datasets
    - Exercise
- [Method piping](#todo-method-piping)
    - Piping example @Aurora
- [Combine Data Tables](#combine-data-tables)
    - Append tables of similar data
    - Exercise
    - Join tables with common variables
    - Exercise 
- [Sharing Insights](#self-study---sharing-insights) @Aurora
    - Mess up data for presentation with pivot
    - Save to CSV
    - More visualizations

## Read CSV Data

### Importing packages

In [None]:
import pandas as pd

### Read CSV data with pandas

In [None]:
pd.read_csv("../data/kap1.csv")

### Inspect pandas data frames

In [None]:
pd.read_csv("../data/kap1.csv").info()

#### Add parameters to read data properly

In [None]:
pd.read_csv("../data/kap1.csv", header=5)

In [None]:
pd.read_csv("../data/kap1.csv", header=4)

In [None]:
budget = pd.read_csv("../data/kap1.csv", header=4)

In [None]:
budget.info()

In [None]:
budget.loc[0]

In [None]:
budget.loc["Norge"]

In [None]:
pd.read_csv("../data/kap1.csv", header=4, index_col=0)

In [None]:
budget = pd.read_csv("../data/kap1.csv", header=4, index_col=0)

In [None]:
budget.info()

In [None]:
budget.describe()

In [None]:
budget.loc["Norge"]

In [None]:
budget.loc[0]

In [None]:
budget.iloc[0]

In [None]:
budget.Budsjettiltak

In [None]:
budget.Lån og garantier

In [None]:
budget["Lån og garantier"]

In [None]:
budget.loc[:, "Lån og garantier"]

### Rename columns

In [None]:
pd.read_csv("../data/kap1.csv", header=4, index_col=0).rename(
    columns={"Budsjettiltak": "tiltak", "Lån og garantier": "lån"}
)

In [None]:
budget = pd.read_csv(
    "../data/kap1.csv", header=4, index_col=0
).rename(columns={"Budsjettiltak": "tiltak", "Lån og garantier": "lån"})

### Exercise

Read data from the file `r"..\data\driftsinntekter.csv"` with `pandas`. Which parameters do you need to specify? Use the [`pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) to look up available parameters. 

In [None]:
pd.read_csv("../data/driftsinntekter.csv", header=1)

## Tidy Data

### Observations and variables

Hadley Wickham introduced the term **tidy data** (<https://tidyr.tidyverse.org/articles/tidy-data.html>). Data tidying is a way to **structure DataFrames to facilitate analysis**.

A DataFrame is tidy if:

- Each variable is a column
- Each observation is a row
- Each DataFrame contains one observational unit

Note that tidy data principles are closely tied to normalization of relational databases.

In [None]:
income = pd.read_csv("../data/driftsinntekter.csv", header=1).rename(
    columns={"Category": "category"}
)
income

Is the `income` data frame tidy?

> No, _2019_, _2020_, and _2021_ are not variables. They are values of a _year_ variable

### Melt messy datasets to tidy them

In [None]:
income.melt()

In [None]:
income.melt(id_vars=["category"])

In [None]:
income.melt(id_vars=["category"], var_name="year")

In [None]:
income.melt(id_vars=["category"], var_name="year", value_name="income")

In [None]:
income = (
    pd.read_csv("../data/driftsinntekter.csv", header=1)
    .rename(columns={"Category": "category"})
    .melt(id_vars=["category"], var_name="year", value_name="income")
)

### Visualizations

In [None]:
income.plot()

In [None]:
budget.plot()

In [None]:
budget.plot.barh()

### Exercise

Tidy the following data frame:

In [None]:
schedule = pd.DataFrame(
    {
        "hour": [19, 20, 21, 22],
        "NRK1": ["Dagsrevyen", "Beat for beat", "Nytt på nytt", "Lindmo"],
        "TV2": ["Kjære landsmenn", "Forræder", "21-nyhetene", "Farfar"],
        "TVNorge": [
            "The Big Bang Theory",
            "Alltid beredt",
            "Kongen befaler",
            "Praktisk info",
        ],
    }
)
schedule

In [None]:
schedule.melt(id_vars=["hour"], var_name="channel", value_name="program")

## Process Data

### Handle missing values

In [None]:
trips = pd.read_csv("../data/09.csv")
trips.info()

In [None]:
trips = pd.read_csv("../data/09.csv",na_values="-")
trips.info()

In [None]:
trips

In [None]:
trips = pd.read_csv("../data/09.csv").rename(columns={"duration": "rental_time"})

In [None]:
trips.dropna()

In [None]:
trips.fillna(0)

### Select variables and observations

In [None]:
trips = pd.read_csv("../data/09.csv").fillna(0)


In [None]:
trips

In [None]:
trips.start_station_name

In [None]:
trips.end_station_name

In [None]:
trips["start_station_name"]

In [None]:
trips.loc[:, "start_station_name"]

In [None]:
trips.loc[4]

In [None]:
trips.loc[0:4]

In [None]:
trips.loc[[0, 4]]

In [None]:
trips.loc[[0, 4],"start_station_name"]

In [None]:
trips.loc[[0, 4],["start_station_name", "end_station_name"]]

In [None]:
trips.iloc[4]

In [None]:
trips.iloc[4:9]

In [None]:
trips.iloc[5:8, 0]

In [None]:
trips.loc[5, "started_at"]

In [None]:
trips.loc[5, trips.columns[1]]

### Combine variables

In [None]:
trips.start_station_name + trips.end_station_name

In [None]:
trips.assign(combined_id=trips.start_station_id + trips.end_station_id)

### Filter observations

In [None]:
trips.query("duration > 20000")

In [None]:
trips.query("duration < 62")

In [None]:
trips.query("end_station_id >= start_station_id")

In [None]:
low_budget = budget[budget["tiltak"] > budget["lån"]]

In [None]:
asian_countries = ["India", "Russland", "Korea", "Kina", "Japan"]
asian_budgets = budget[budget.index.isin(asian_countries)]

### Sort observations

In [None]:
trips.sort_values(by="duration")

In [None]:
trips.sort_values(by=["duration", "start_station_id"])

In [None]:
trips.sort_index()

### Exercise

TODO: Something something bysykkel

### TODO - Assign new columns
    

### TODO - Custom lambda functions on assign

## Aggregate Data

### Bigger datasets

In [None]:
pd.read_csv("../data/09.csv")

In [None]:
trips = pd.read_csv("../data/09.csv")

In [None]:
trips.info()

### Date columns

In [None]:
trips = pd.read_csv("../data/09.csv", parse_dates=["started_at", "ended_at"])
trips.info()

### Group by common values

In [None]:
trips.groupby("start_station_name")

In [None]:
trips.groupby("start_station_name").size()

In [None]:
trips.groupby("start_station_name").size().sort_values()

In [None]:
trips.groupby("start_station_name").size().reset_index()

In [None]:
(
    trips.groupby("start_station_name")
    .size()
    .reset_index()
    .rename(columns={0: "num_trips"})
)

In [None]:
(
    trips.groupby("start_station_name")
    .size()
    .reset_index()
    .rename(columns={0: "num_trips"})
    .sort_values(by="num_trips")
)

In [None]:
(
    trips.groupby("end_station_name")
    .size()
    .reset_index()
    .rename(columns={0: "num_trips"})
    .sort_values(by="num_trips")
)

In [None]:
num_trips = (
    trips.groupby("start_station_name")
    .size()
    .reset_index()
    .rename(columns={0: "num_trips"})
    .sort_values(by="num_trips")
)

### Aggregations: sum, mean, median, first, count, ...

In [None]:
trips.groupby("start_station_name").median()

In [None]:
trips.groupby("start_station_name").agg(median_duration=("duration", "median"))

In [None]:
# Sidenote, we could do the size example as follows
trips.groupby("start_station_name").agg(
    num_trips=("start_station_name", "size")
).reset_index().sort_values(by="num_trips")

In [None]:
trips.groupby("start_station_name").agg(
    median_duration=("duration", "median"),
    description=("start_station_description", "first"),
)

In [None]:
def most_common(column):
    return column.mode().iloc[0]


trips.groupby("start_station_name").agg(
    median_duration=("duration", "median"),
    description=("start_station_description", "first"),
    common_end_station=("end_station_name", most_common),
)

In [None]:
trips.groupby(["start_station_name", "end_station_name"]).agg(
    median_duration=("duration", "median")
)

In [None]:
trips.groupby(["start_station_name", "end_station_name"]).agg(
    median_duration=("duration", "median"),
    start_station_description=("start_station_description", "first"),
    end_station_description=("end_station_description", "first"),
)

In [None]:
trips.groupby(["start_station_name", "end_station_name"]).agg(
    median_duration=("duration", "median"),
    start_station_description=("start_station_description", "first"),
    end_station_description=("end_station_description", "first"),
).reset_index()

### Looping through large datasets

In [None]:
for index, row in trips.iterrows():
    trips.at[index, 'end_station_coordinates'] = str(row['end_station_latitude']) + ', ' + str(row['end_station_longitude'])

In [None]:
trips['end_station_coordinates'] = trips['end_station_latitude'].astype(str) + ', ' + trips['end_station_longitude'].astype(str)

In [None]:
for i in range(len(trips)):
    trips.at[i, 'trip_duration_minutes'] = (trips.at[i, 'ended_at'] - trips.at[i, 'started_at']).total_seconds() / 60

In [None]:
trips['trip_duration_minutes'] = (trips['ended_at'] - trips['started_at']).dt.total_seconds() / 60

In [None]:
for i in range(len(trips)):
    trips.at[i, 'long_trip'] = trips.at[i, 'trip_duration_minutes'] > 30

In [None]:
trips['long_trip'] = [trip.trip_duration_minutes > 30 for trip in trips.itertuples()]

### Exercise

You have a dataset for the month of September (09.csv), containing information about bike trips. Your task is to aggregate the data and find the total number of long trips (where the duration is greater than 30 minutes) for each end station. 

In [None]:
trips_sep = pd.read_csv("../data/09.csv", parse_dates=["started_at", "ended_at"])

trips_sep['long_trip'] = trips_sep['duration'] > 30
long_trip_counts = trips_sep.groupby('end_station_name')['long_trip'].sum().reset_index()
long_trip_counts.columns = ['end_station_name', 'total_long_trips']

print(long_trip_counts)

## TODO: Method piping

## Combine Data Tables

We have two files with the same kinds of data: `08.csv` with data for August and `09.csv` with data for September. How can we combine them into one DataFrame?

In [None]:
trips_aug = pd.read_csv("../data/08.csv", parse_dates=["started_at", "ended_at"])
trips_sep = pd.read_csv("../data/09.csv", parse_dates=["started_at", "ended_at"])

### Append tables with similar data

In [None]:
pd.concat([trips_aug, trips_sep])

In [None]:
pd.concat([trips_aug, trips_sep]).reset_index()

In [None]:
pd.concat([trips_aug, trips_sep]).reset_index(drop=True)

In [None]:
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)

In [None]:
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)
    trips = pd.read_csv(filnavn, parse_dates=["started_at", "ended_at"])

In [None]:
trips.started_at

In [None]:
months = []
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)
    months.append(pd.read_csv(filnavn, parse_dates=["started_at", "ended_at"]))

In [None]:
months

In [None]:
months = []
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)
    months.append(pd.read_csv(filnavn, parse_dates=["started_at", "ended_at"]))
trips = pd.concat(months).reset_index(drop=True)

In [None]:
trips

In [None]:
import pathlib

pathlib.Path.cwd().parent / "data"

In [None]:
(pathlib.Path.cwd().parent / "data").glob("*.csv")

In [None]:
list((pathlib.Path.cwd().parent / "data").glob("*.csv"))

In [None]:
months = []
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)
    months.append(pd.read_csv(filnavn, parse_dates=["started_at", "ended_at"]))
trips = pd.concat(months).reset_index(drop=True)

### Exercise

You have two datasets, one for August (08.csv) and another for September (09.csv). Each dataset contains information about bike trips. Your task is to combine these two datasets and find out the total number of bike trips for each station in these two months.

In [None]:
#fasit

# Read the data
trips_aug = pd.read_csv("../data/08.csv", parse_dates=["started_at", "ended_at"])
trips_sep = pd.read_csv("../data/09.csv", parse_dates=["started_at", "ended_at"])

# Combine the datasets
trips = pd.concat([trips_aug, trips_sep]).reset_index(drop=True)

# Count the number of trips for each station
station_counts = trips['start_station_name'].value_counts().reset_index()
station_counts.columns = ['start_station_name', 'trip_count']

print(station_counts)

### Join tables with common variables

In [None]:
num_trips = (
    trips.groupby("start_station_name")
    .size()
    .reset_index(name="num_trips")
    .sort_values(by="num_trips")
)
num_trips

In [None]:
trip_lengths = (
    trips.groupby("start_station_name")
    .agg(median_duration=("duration", "median"))
    .reset_index()
    .sort_values(by="median_duration")
)
trip_lengths

In [None]:
pd.merge(num_trips, trip_lengths)

In [None]:
num_trips_from = (
    trips.groupby("start_station_name")
    .agg(num_trips=("start_station_name", "size"))
    .sort_values(by="num_trips")
    .reset_index()
)
num_trips_from

In [None]:
num_trips_to = (
    trips.groupby("end_station_name")
    .agg(num_trips=("end_station_name", "size"))
    .sort_values(by="num_trips")
    .reset_index()
)
num_trips_to

In [None]:
pd.merge(num_trips_from, num_trips_to)

In [None]:
pd.merge(
    num_trips_from,
    num_trips_to,
    left_on="start_station_name",
    right_on="end_station_name",
)

In [None]:
popular_from = num_trips_from.nlargest(10, "num_trips")
popular_to = num_trips_to.nlargest(10, "num_trips")

In [None]:
pd.merge(
    popular_from, popular_to, left_on="start_station_name", right_on="end_station_name"
)

In [None]:
pd.merge(
    popular_from,
    popular_to,
    how="inner",
    left_on="start_station_name",
    right_on="end_station_name",
)

In [None]:
pd.merge(
    popular_from,
    popular_to,
    how="left",
    left_on="start_station_name",
    right_on="end_station_name",
)

In [None]:
pd.merge(
    popular_from,
    popular_to,
    how="right",
    left_on="start_station_name",
    right_on="end_station_name",
)

In [None]:
pd.merge(
    popular_from,
    popular_to,
    how="outer",
    left_on="start_station_name",
    right_on="end_station_name",
)

### Exercise 2

Merge the trips DataFrame with the stations DataFrame to add the address and number of bike docks for each start station to the trips DataFrame. We want to merge on the start_station_id in the trips DataFrame and the station_id in the stations DataFrame

In [None]:
stations = pd.DataFrame({
    'station_id': [564, 421, 621, 447, 430, 558, 424, 428],
    'station_name': ['Oscars gate', 'Alexander Kiellands Plass', 'Torshovdalen øst', 'Kværnerbyen', 'Spikersuppa Vest', 'Dokkveien', 'Birkelunden', 'Olav Kyrres plass'],
    'address': ['Oscars gate 1', 'Alexander Kiellands Plass 2', 'Torshovdalen øst 3', 'Kværnerbyen 4', 'Spikersuppa Vest 5', 'Dokkveien 6', 'Birkelunden 7', 'Olav Kyrres plass 8'],
    'num_docks': [10, 15, 12, 14, 16, 14, 15, 12]
})

In [None]:
#fasit

# Rename the columns in stations to indicate they are about the start station
stations.columns = ['start_station_id', 'start_station_name', 'start_station_address', 'start_station_num_docks']

# Merge the DataFrames
merged_trips = pd.merge(trips, stations, on=['start_station_id', 'start_station_name'], how='left')

print(merged_trips.head())

### Exercise 1

What is the 10 most popular destionations from Alexander Kiellands Plass? Sorted from most rides to frewest.  

In [None]:
#fasit

# Filter the DataFrame
filtered_trips = trips[trips['start_station_name'] == 'Alexander Kiellands Plass']

# Group by end_station_name and count, sort in descending order, take top 10
top_10_end_stations = filtered_trips.groupby('end_station_name').size().sort_values(ascending=False).head(10).reset_index(name='count')

print(top_10_end_stations)

## Self study - Sharing Insights

### Mess up data for presentation

In [None]:
from_to = (
    trips.groupby(["start_station_name", "end_station_name"])
    .agg(num_trips=("start_station_name", "size"))
    .reset_index()
    .sort_values(by="num_trips")
)

In [None]:
from_to.query(
    "start_station_name.isin(@popular_from.start_station_name) and end_station_name.isin(@popular_to.end_station_name)"
).pivot_table(
    index="start_station_name", columns="end_station_name", values="num_trips"
)

### Save to CSV

In [None]:
from_to.to_csv("from_to.csv", index=False)

### More visualizations

In [None]:
from_to

In [None]:
num_trips_to = (
    trips.groupby("end_station_name")
    .agg(num_trips=("end_station_name", "size"), lat=("end_station_latitude", "first"), lon=("end_station_longitude", "first"))
    .sort_values(by="num_trips")
    .reset_index()
)

In [None]:
import numpy as np
pd.merge(
    num_trips_from,
    num_trips_to,
    left_on="start_station_name",
    right_on="end_station_name",
    suffixes=("_from", "_to")
).assign(from_over_to=lambda df: np.log(df.num_trips_from/df.num_trips_to)).plot.scatter(x="lon", y="lat", c="from_over_to")