# Introduction to Data Processing With Python 



- Introduction
- [Read CSV Data](#read-CSV-data) 
    - Import `pandas`
    - Read CSV data with `pandas`
    - Bigger datasets
    - Inspect a `pandas` data frame
    - Rename columns/variables
    - Exercise
- [Tidy Data](#tidy-data) @Aurora
    - Obervations and variables #TODO endre til å håndtere bysykkel-datasettet fremfor excel-fil
    - Melt messy data to create tidy data
    - Visualizations
    - Exercise
- [Process Data](#process-data) 
    - Handle missing values 
    - Select variables
    - Combine variables
    - Filter observations (rows)
    - Sort observations (rows)
    - Select variables (columns)
    - Exercise
    - Assign new variables (columns) @Aurora
    - Custom lambda functions on assign @Aurora
    - Exercise @Aurora
- [Aggregate Data](#aggregate-data)
    - Date columns
    - Method piping @Aurora
    - Group by common values
    - Aggregations: sum, mean, first, median, count
    - Exercise
- [Method piping](#todo-method-piping)
    - piping example @Aurora
- [Combine Data Tables](#combine-data-tables)
    - Append tables of similar data
    - Exercise
    - Join tables with common variables
    - Exercise 
- [Sharing Insights](#self-study---sharing-insights) @Aurora
    - Mess up data for presentation with pivot
    - Save to Excel (and other formats)
    - More visualizations

## Read CSV Data

### Importing packages

In [None]:
import pandas as pd

### Read CSV data with pandas

In [None]:
pd.read_csv("../data/09.csv")

### Inspect pandas data frames

In [None]:
pd.read_csv("../data/09.csv").info()

### Add parameters to read CSV data 

In [None]:
pd.read_csv("../data/09.csv", header=1)

In [None]:
pd.read_csv("../data/09.csv", skiprows=1)

In [None]:
budget = pd.read_csv("../data/09.csv")

In [None]:
budget.info()

In [None]:
budget.loc[0]

In [None]:
budget.loc["2022-09-01 03:04:31.178000+00:00"]

In [None]:
budget = pd.read_csv("../data/09.csv", index_col=0)

In [None]:
budget.describe()

In [None]:
budget.loc[0]

In [None]:
budget.iloc[0]

In [None]:
budget.loc["2022-09-01 03:04:31.178000+00:00"]

In [None]:
budget.start_station_name

In [None]:
budget.loc[:, "start_station_name"]

In [None]:
budget = pd.read_csv("../data/09.csv", index_col=0).rename(
    columns={"duration": "trip_duration"}
)

### Alternativt: kap1 som csv fil

In [None]:
pd.read_csv("../data/kap1.csv", header=5)

In [None]:
pd.read_csv("../data/kap1.csv", header=4)

In [None]:
budget = pd.read_csv("../data/kap1.csv", header=4)

In [None]:
budget.info()

In [None]:
budget.loc[0]

In [None]:
budget.loc["Norge"]

In [None]:
pd.read_csv("../data/kap1.csv", header=4, index_col=0)

In [None]:
budget = pd.read_csv("../data/kap1.csv", header=4, index_col=0)

In [None]:
budget.info()

In [None]:
budget.describe()

In [None]:
budget.loc["Norge"]

In [None]:
budget.loc[0]

In [None]:
budget.iloc[0]

In [None]:
budget.Budsjettiltak

In [None]:
budget.Lån og garantier

In [None]:
budget["Lån og garantier"]

In [None]:
budget.loc[:, "Lån og garantier"]

In [None]:
pd.read_csv("../data/kap1.csv", header=4, index_col=0).rename(
    columns={"Budsjettiltak": "tiltak", "Lån og garantier": "lån"}
)

In [None]:
budget = pd.read_csv(
    "../data/kap1.csv", header=4, index_col=0
).rename(columns={"Budsjettiltak": "tiltak", "Lån og garantier": "lån"})

### Exercise

Read data from the file `r"..\data\driftsinntekter.csv"` with `pandas`. Which parameters do you need to specify? Use the [`pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) to look up available parameters. 

In [None]:
pd.read_csv("../data/driftsinntekter.csv", header=1)

## Tidy Data

### Observations and variables

Hadley Wickham introduced the term **tidy data** (<https://tidyr.tidyverse.org/articles/tidy-data.html>). Data tidying is a way to **structure DataFrames to facilitate analysis**.

A DataFrame is tidy if:

- Each variable is a column
- Each observation is a row
- Each DataFrame contains one observational unit

Note that tidy data principles are closely tied to normalization of relational databases.

Is the dataframe below in a tidy format?

In [59]:
income = pd.read_excel("../data/driftsinntekter-2021.xlsx", header=1).rename(
    columns={"Category": "category"}
)
income

Unnamed: 0,category,2019,2020,2021
0,Basistilskudd,216,203,202
1,Basistilskudd personell og kompetanse,-,-,3
2,Forvaltningsmidler,307.7,196.9,-
3,Tilskudd H.U. Sverdrup II,284.3,274.2,200.1
4,Tilskudd Kjemiidentifikasjonslab,107.1,55.7,88.8
5,CBRNE-satsningen,-,-,85
6,Tilskudd og overføringer,389.4,439.1,341.8
7,Strategiske FoU-midler,2131.1,2109,2159
8,Konsept- og forprosjekteringsmidler,838.2,696.2,635.1
9,Forsvarssektoren,3759,3988,5180


Is the `income` data frame tidy?

> No, _2019_, _2020_, and _2021_ are not variables. They are values of a _year_ variable

### Melt messy datasets to tidy them

In [60]:
income.melt()

Unnamed: 0,variable,value
0,category,Basistilskudd
1,category,Basistilskudd personell og kompetanse
2,category,Forvaltningsmidler
3,category,Tilskudd H.U. Sverdrup II
4,category,Tilskudd Kjemiidentifikasjonslab
5,category,CBRNE-satsningen
6,category,Tilskudd og overføringer
7,category,Strategiske FoU-midler
8,category,Konsept- og forprosjekteringsmidler
9,category,Forsvarssektoren


In [61]:
income.melt(id_vars=["category"])

Unnamed: 0,category,variable,value
0,Basistilskudd,2019,216
1,Basistilskudd personell og kompetanse,2019,-
2,Forvaltningsmidler,2019,307.7
3,Tilskudd H.U. Sverdrup II,2019,284.3
4,Tilskudd Kjemiidentifikasjonslab,2019,107.1
5,CBRNE-satsningen,2019,-
6,Tilskudd og overføringer,2019,389.4
7,Strategiske FoU-midler,2019,2131.1
8,Konsept- og forprosjekteringsmidler,2019,838.2
9,Forsvarssektoren,2019,3759


In [62]:
income.melt(id_vars=["category"], var_name="year")

Unnamed: 0,category,year,value
0,Basistilskudd,2019,216
1,Basistilskudd personell og kompetanse,2019,-
2,Forvaltningsmidler,2019,307.7
3,Tilskudd H.U. Sverdrup II,2019,284.3
4,Tilskudd Kjemiidentifikasjonslab,2019,107.1
5,CBRNE-satsningen,2019,-
6,Tilskudd og overføringer,2019,389.4
7,Strategiske FoU-midler,2019,2131.1
8,Konsept- og forprosjekteringsmidler,2019,838.2
9,Forsvarssektoren,2019,3759


In [63]:
income.melt(id_vars=["category"], var_name="year", value_name="income")

Unnamed: 0,category,year,income
0,Basistilskudd,2019,216
1,Basistilskudd personell og kompetanse,2019,-
2,Forvaltningsmidler,2019,307.7
3,Tilskudd H.U. Sverdrup II,2019,284.3
4,Tilskudd Kjemiidentifikasjonslab,2019,107.1
5,CBRNE-satsningen,2019,-
6,Tilskudd og overføringer,2019,389.4
7,Strategiske FoU-midler,2019,2131.1
8,Konsept- og forprosjekteringsmidler,2019,838.2
9,Forsvarssektoren,2019,3759


In [67]:
# This could potentially be a piping excersize.

income = (
    pd.read_csv("../data/driftsinntekter.csv", header=1)
    .rename(columns={"Category": "category"})
    .melt(id_vars=["category"], var_name="year", value_name="income")
)

### Exercise

Use pd.melt to tidy the following data frame:
Set the var_name to "channel" and set the value_name to "program".


In [None]:
schedule = pd.DataFrame(
    {
        "hour": [19, 20, 21, 22],
        "NRK1": ["Dagsrevyen", "Beat for beat", "Nytt på nytt", "Lindmo"],
        "TV2": ["Kjære landsmenn", "Forræder", "21-nyhetene", "Farfar"],
        "TVNorge": [
            "The Big Bang Theory",
            "Alltid beredt",
            "Kongen befaler",
            "Praktisk info",
        ],
    }
)
schedule

In [None]:
schedule.melt(id_vars=["hour"], var_name="channel", value_name="program")

## Process Data

### Handle missing values

In [None]:
trips = pd.read_csv("../data/09.csv")
trips.info()

In [None]:
trips = pd.read_csv("../data/09.csv",na_values="-")
trips.info()

In [None]:
trips

In [None]:
trips = pd.read_csv("../data/09.csv").rename(columns={"duration": "rental_time"})

In [None]:
trips.dropna()

In [None]:
trips.fillna(0)

### Select variables and observations

In [None]:
trips = pd.read_csv("../data/09.csv").fillna(0)


In [None]:
trips

In [None]:
trips.start_station_name

In [None]:
trips.end_station_name

In [None]:
trips["start_station_name"]

In [None]:
trips.loc[:, "start_station_name"]

In [None]:
trips.loc[4]

In [None]:
trips.loc[0:4]

In [None]:
trips.loc[[0, 4]]

In [None]:
trips.loc[[0, 4],"start_station_name"]

In [None]:
trips.loc[[0, 4],["start_station_name", "end_station_name"]]

In [None]:
trips.iloc[4]

In [None]:
trips.iloc[4:9]

In [None]:
trips.iloc[5:8, 0]

In [None]:
trips.loc[5, "started_at"]

In [None]:
trips.loc[5, trips.columns[1]]

### Combine variables

In [None]:
trips.start_station_name + trips.end_station_name

In [None]:
trips.assign(combined_id=trips.start_station_id + trips.end_station_id)

### Filter observations

In [None]:
trips.query("duration > 20000")

In [None]:
trips.query("duration < 62")

In [None]:
trips.query("end_station_id >= start_station_id")

In [None]:
#Kan eventuelt legge til eksempel på å filtrere ut en delmengde og lagre i en ny dataframe
low_budget = budget[budget["tiltak"] > budget["lån"]]

In [None]:
#Kan eventuelt legge til eksempel på å filtrere ut en delmengde og lagre i en ny dataframe
low_budget = budget[budget["tiltak"] > budget["lån"]]

### Sort observations

In [None]:
trips.sort_values(by="duration")

In [None]:
trips.sort_values(by=["duration", "start_station_id"])

In [None]:
trips.sort_index()

### Exercise

TODO: Something something bysykkel

### Assign new columns
If we want to add a new column to the dataframe we can do it with an "assign" statement. 

Let's assign a new duration column in minutes at the end of our dataframe:
    

In [1]:
import pandas as pd
df = pd.read_csv("../data/09.csv")

df = df.assign(duration_minutes = df["duration"]/60)
df.head()
#

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,started_at,ended_at,duration,start_station_id,start_station_name,start_station_description,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_description,end_station_latitude,end_station_longitude,duration_minutes
0,2022-09-01 03:04:31.178000+00:00,2022-09-01 03:13:01.298000+00:00,510,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,8.5
1,2022-09-01 03:11:09.104000+00:00,2022-09-01 03:14:52.506000+00:00,223,578,Hallings gate,langs Dalsbergstien,59.922777,10.738655,499,Bjerregaards gate,ovenfor Fredrikke Qvams gate,59.925488,10.746058,3.716667
2,2022-09-01 03:11:37.861000+00:00,2022-09-01 03:23:23.939000+00:00,706,421,Alexander Kiellands Plass,langs Maridalsveien,59.928067,10.751203,390,Saga Kino,langs Olav Vs gate,59.91424,10.732771,11.766667
3,2022-09-01 03:13:00.843000+00:00,2022-09-01 03:17:17.639000+00:00,256,735,Oslo Hospital,ved trikkestoppet,59.903213,10.767344,465,Bjørvika,under broen Nylandsveien,59.909006,10.75618,4.266667
4,2022-09-01 03:13:13.330000+00:00,2022-09-01 03:24:15.758000+00:00,662,525,Myraløkka Øst,ved Bentsenbrua,59.937205,10.760581,443,Sjøsiden ved trappen,Oslo S,59.910154,10.751981,11.033333


### Custom lambda functions on assign

Sometimes you need custom functions to do more complex calculations for each row, where several variable columns are involved. You can do this by implementing the assign statement together with a custom lambda function.  

Let's for each trip calculate the "bird's perspective" distance in [km] between the start coordinates and the end coordinates. The function calculating this is given below:



In [5]:
# define the custom lambda function you want to implement:
from geopy.distance import geodesic

def calculate_distance(lat1, lon1, lat2, lon2):
    """
    Formula that calculates the distance between two locations and outputs
    resulting distance in km:
    """
    origin = (lat1,lon1)
    destination = (lat2, lon2)

    distance_km = geodesic(origin,destination).kilometers
    return distance_km

In [None]:
# call the funtion inside an assign statment and check what the new dataframe looks like:

df = df.assign(distance = df.apply(lambda x: calculate_distance(x["start_station_latitude"], 
                                                                x["start_station_longitude"],
                                                                x["end_station_latitude"],
                                                                x["end_station_longitude"]),
                                                                axis=1))

### Exercise

Let's do a simple exercise with the simple function below. 

Use what you have learned above to implement the double_input function to assign a new column called **double_duration** that containes the double duration of the original **duration** column in the dataframe. 

In [2]:
def double_input(input_var):
    return input_var*2

df = df.assign(double_duration = df.apply(lambda x: double_input(x["duration"]), axis=1))
df.head()

Unnamed: 0,started_at,ended_at,duration,start_station_id,start_station_name,start_station_description,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_description,end_station_latitude,end_station_longitude,duration_minutes,double_duration
0,2022-09-01 03:04:31.178000+00:00,2022-09-01 03:13:01.298000+00:00,510,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,8.5,1020
1,2022-09-01 03:11:09.104000+00:00,2022-09-01 03:14:52.506000+00:00,223,578,Hallings gate,langs Dalsbergstien,59.922777,10.738655,499,Bjerregaards gate,ovenfor Fredrikke Qvams gate,59.925488,10.746058,3.716667,446
2,2022-09-01 03:11:37.861000+00:00,2022-09-01 03:23:23.939000+00:00,706,421,Alexander Kiellands Plass,langs Maridalsveien,59.928067,10.751203,390,Saga Kino,langs Olav Vs gate,59.91424,10.732771,11.766667,1412
3,2022-09-01 03:13:00.843000+00:00,2022-09-01 03:17:17.639000+00:00,256,735,Oslo Hospital,ved trikkestoppet,59.903213,10.767344,465,Bjørvika,under broen Nylandsveien,59.909006,10.75618,4.266667,512
4,2022-09-01 03:13:13.330000+00:00,2022-09-01 03:24:15.758000+00:00,662,525,Myraløkka Øst,ved Bentsenbrua,59.937205,10.760581,443,Sjøsiden ved trappen,Oslo S,59.910154,10.751981,11.033333,1324


## Aggregate Data

### Bigger datasets

In [None]:
pd.read_csv("../data/09.csv")

In [None]:
trips = pd.read_csv("../data/09.csv")

In [None]:
trips.info()

### Date columns

In [23]:
trips = pd.read_csv("../data/09.csv", parse_dates=["started_at", "ended_at"])
trips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165877 entries, 0 to 165876
Data columns (total 13 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   started_at                 165877 non-null  object 
 1   ended_at                   165877 non-null  object 
 2   duration                   165877 non-null  int64  
 3   start_station_id           165877 non-null  int64  
 4   start_station_name         165877 non-null  object 
 5   start_station_description  165876 non-null  object 
 6   start_station_latitude     165877 non-null  float64
 7   start_station_longitude    165877 non-null  float64
 8   end_station_id             165877 non-null  int64  
 9   end_station_name           165877 non-null  object 
 10  end_station_description    165876 non-null  object 
 11  end_station_latitude       165877 non-null  float64
 12  end_station_longitude      165877 non-null  float64
dtypes: float64(4), int64(3), obje

### Method piping 

Also known as method chaining

So far you have learned to do one operation on a datafra at a time. 

When you have many subsequent data processing operations on the same dataframe you can pipe the methods
syntactically. 

Let's assume we want to do the following subsequent steps on one dataframe:
* Read the data 
* Filter for specific rows: where **start_station_id = 437** and **end_station_id = 583**
* Assign a column double duration
* Sort the rows from longest duration to shortest duration 

Let's do all that with piping!!!



In [22]:
(pd.read_csv("../data/09.csv")
 .query("start_station_id==437 & end_station_id==583")
 .assign(double_duration = 2*df.duration)
 .sort_values(by="duration", ascending=False)  
)

Unnamed: 0,started_at,ended_at,duration,start_station_id,start_station_name,start_station_description,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_description,end_station_latitude,end_station_longitude,double_duration
49301,2022-09-08 13:45:19.320000+00:00,2022-09-08 13:59:33.731000+00:00,854,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,1708
128028,2022-09-22 13:37:11.264000+00:00,2022-09-22 13:47:30.743000+00:00,619,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,1238
78485,2022-09-13 21:35:59.929000+00:00,2022-09-13 21:46:05.870000+00:00,605,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,1210
0,2022-09-01 03:04:31.178000+00:00,2022-09-01 03:13:01.298000+00:00,510,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,1020
97434,2022-09-17 03:49:10.987000+00:00,2022-09-17 03:57:04.231000+00:00,473,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,946


### Group by common values

Grouping is a very important operation and comes often together with other wrangling commands.

PERFECT for piping/chaining!

Test first what groupby does to a dataframe:

In [24]:
trips.groupby("start_station_name")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11ec94eb0>

The .size() method calculates the size of each group created by groupby.

Returns the number of rows if Series. Otherwise returns the number of rows times number of columns if DataFrame.

In [25]:
trips.groupby("start_station_name").size()

start_station_name
7. juni-plassen           959
AHO                      1547
Adamstuen                 316
Aker Brygge              2079
Akerselvallmenningen      661
                         ... 
Vulkan                   1216
Vålerenga                 245
Vår Frelsers gravlund    1442
Økern Portal              144
Økernveien                 86
Length: 263, dtype: int64

In [26]:
trips.groupby("start_station_name").size().sort_values()

start_station_name
Tordenskiolds gate             44
Gaustad T-bane                 58
Furulund                       72
Kværnerveien                   74
Grenseveien                    84
                             ... 
Olaf Ryes plass              1918
Ringnes Park                 1970
Helga Helgesens plass        1984
Aker Brygge                  2079
Alexander Kiellands Plass    2133
Length: 263, dtype: int64

The method reset_index() convert the series to a dataframe:

In [27]:
trips.groupby("start_station_name").size().reset_index()

Unnamed: 0,start_station_name,0
0,7. juni-plassen,959
1,AHO,1547
2,Adamstuen,316
3,Aker Brygge,2079
4,Akerselvallmenningen,661
...,...,...
258,Vulkan,1216
259,Vålerenga,245
260,Vår Frelsers gravlund,1442
261,Økern Portal,144


In [29]:
(
    trips.groupby("start_station_name")
    .size()
    .reset_index()
    .rename(columns={0: "num_trips"})
)

Unnamed: 0,start_station_name,num_trips
0,7. juni-plassen,959
1,AHO,1547
2,Adamstuen,316
3,Aker Brygge,2079
4,Akerselvallmenningen,661
...,...,...
258,Vulkan,1216
259,Vålerenga,245
260,Vår Frelsers gravlund,1442
261,Økern Portal,144


In [31]:
(
    trips.groupby("start_station_name")
    .size()
    .reset_index()
    .rename(columns={0: "num_trips"})
    .sort_values(by="num_trips")
)

Unnamed: 0,start_station_name,num_trips
233,Tordenskiolds gate,44
78,Gaustad T-bane,58
74,Furulund,72
127,Kværnerveien,74
81,Grenseveien,84
...,...,...
160,Olaf Ryes plass,1918
175,Ringnes Park,1970
96,Helga Helgesens plass,1984
3,Aker Brygge,2079


In [32]:
(
    trips.groupby("end_station_name")
    .size()
    .reset_index()
    .rename(columns={0: "num_trips"})
    .sort_values(by="num_trips")
)

Unnamed: 0,end_station_name,num_trips
78,Gaustad T-bane,22
176,Rolf Wickstrøms vei,44
127,Kværnerveien,60
33,Borgenveien,60
92,HasleLinje,61
...,...,...
192,Sjøsiden ved trappen,2264
234,Torggata,2266
119,Kirkeristen,2318
96,Helga Helgesens plass,2414


In [33]:
num_trips = (
    trips.groupby("start_station_name")
    .size()
    .reset_index()
    .rename(columns={0: "num_trips"})
    .sort_values(by="num_trips")
)

### Aggregations: sum, mean, median, first, count, ...

Why does the below not work?

In [34]:
trips.groupby("start_station_name").median()

TypeError: agg function failed [how->median,dtype->object]

In [35]:
trips.groupby("start_station_name").agg(median_duration=("duration", "median"))

Unnamed: 0_level_0,median_duration
start_station_name,Unnamed: 1_level_1
7. juni-plassen,423.0
AHO,482.0
Adamstuen,521.5
Aker Brygge,625.0
Akerselvallmenningen,738.0
...,...
Vulkan,472.0
Vålerenga,503.0
Vår Frelsers gravlund,420.0
Økern Portal,858.0


In [36]:
# Sidenote, we could do the size example as follows
trips.groupby("start_station_name").agg(
    num_trips=("start_station_name", "size")
).reset_index().sort_values(by="num_trips")

Unnamed: 0,start_station_name,num_trips
233,Tordenskiolds gate,44
78,Gaustad T-bane,58
74,Furulund,72
127,Kværnerveien,74
81,Grenseveien,84
...,...,...
160,Olaf Ryes plass,1918
175,Ringnes Park,1970
96,Helga Helgesens plass,1984
3,Aker Brygge,2079


In [37]:
trips.groupby("start_station_name").agg(
    median_duration=("duration", "median"),
    description=("start_station_description", "first"),
)

Unnamed: 0_level_0,median_duration,description
start_station_name,Unnamed: 1_level_1,Unnamed: 2_level_1
7. juni-plassen,423.0,langs Henrik Ibsens gate
AHO,482.0,langs Maridalsveien
Adamstuen,521.5,ved trikkestoppen
Aker Brygge,625.0,ved trikkestopp
Akerselvallmenningen,738.0,øst for Operaen
...,...,...
Vulkan,472.0,ved Maridalsveien
Vålerenga,503.0,langs Ingeborgs gate
Vår Frelsers gravlund,420.0,langs Ullevålsveien
Økern Portal,858.0,ved Dag Hammarskjölds vei


In [38]:
def most_common(column):
    return column.mode().iloc[0]


trips.groupby("start_station_name").agg(
    median_duration=("duration", "median"),
    description=("start_station_description", "first"),
    common_end_station=("end_station_name", most_common),
)

Unnamed: 0_level_0,median_duration,description,common_end_station
start_station_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7. juni-plassen,423.0,langs Henrik Ibsens gate,OsloMet
AHO,482.0,langs Maridalsveien,Helga Helgesens plass
Adamstuen,521.5,ved trikkestoppen,Vår Frelsers gravlund
Aker Brygge,625.0,ved trikkestopp,Saga Kino
Akerselvallmenningen,738.0,øst for Operaen,Aker Brygge
...,...,...,...
Vulkan,472.0,ved Maridalsveien,Vulkan
Vålerenga,503.0,langs Ingeborgs gate,Helga Helgesens plass
Vår Frelsers gravlund,420.0,langs Ullevålsveien,Akersgata
Økern Portal,858.0,ved Dag Hammarskjölds vei,Munkegata


In [39]:
trips.groupby(["start_station_name", "end_station_name"]).agg(
    median_duration=("duration", "median")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,median_duration
start_station_name,end_station_name,Unnamed: 2_level_1
7. juni-plassen,7. juni-plassen,101.0
7. juni-plassen,AHO,728.0
7. juni-plassen,Adamstuen,711.0
7. juni-plassen,Aker Brygge,281.0
7. juni-plassen,Akerselvallmenningen,587.0
...,...,...
Økernveien,Tøyen skole,145.5
Økernveien,Tøyenbekken,308.0
Økernveien,Tøyenparken,72.0
Økernveien,Vulkan,569.0


In [None]:
trips.groupby(["start_station_name", "end_station_name"]).agg(
    median_duration=("duration", "median"),
    start_station_description=("start_station_description", "first"),
    end_station_description=("end_station_description", "first"),
)

In [None]:
trips.groupby(["start_station_name", "end_station_name"]).agg(
    median_duration=("duration", "median"),
    start_station_description=("start_station_description", "first"),
    end_station_description=("end_station_description", "first"),
).reset_index()

### Looping through large datasets

In [None]:
for index, row in trips.iterrows():
    trips.at[index, 'end_station_coordinates'] = str(row['end_station_latitude']) + ', ' + str(row['end_station_longitude'])

In [None]:
trips['end_station_coordinates'] = trips['end_station_latitude'].astype(str) + ', ' + trips['end_station_longitude'].astype(str)

In [None]:
for i in range(len(trips)):
    trips.at[i, 'trip_duration_minutes'] = (trips.at[i, 'ended_at'] - trips.at[i, 'started_at']).total_seconds() / 60

In [None]:
trips['trip_duration_minutes'] = (trips['ended_at'] - trips['started_at']).dt.total_seconds() / 60

In [None]:
for i in range(len(trips)):
    trips.at[i, 'long_trip'] = trips.at[i, 'trip_duration_minutes'] > 30

In [None]:
trips['long_trip'] = [trip.trip_duration_minutes > 30 for trip in trips.itertuples()]

### Exercise

You have a dataset for the month of September (09.csv), containing information about bike trips. Your task is to aggregate the data and find the total number of long trips (where the duration is greater than 30 minutes) for each end station. 

In [None]:
trips_sep = pd.read_csv("../data/09.csv", parse_dates=["started_at", "ended_at"])

trips_sep['long_trip'] = trips_sep['duration'] > 30
long_trip_counts = trips_sep.groupby('end_station_name')['long_trip'].sum().reset_index()
long_trip_counts.columns = ['end_station_name', 'total_long_trips']

print(long_trip_counts)

Unnamed: 0,started_at,ended_at,duration,start_station_id,start_station_name,start_station_description,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_description,end_station_latitude,end_station_longitude,double_duration
97434,2022-09-17 03:49:10.987000+00:00,2022-09-17 03:57:04.231000+00:00,473,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,946
0,2022-09-01 03:04:31.178000+00:00,2022-09-01 03:13:01.298000+00:00,510,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,1020
78485,2022-09-13 21:35:59.929000+00:00,2022-09-13 21:46:05.870000+00:00,605,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,1210
128028,2022-09-22 13:37:11.264000+00:00,2022-09-22 13:47:30.743000+00:00,619,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,1238
49301,2022-09-08 13:45:19.320000+00:00,2022-09-08 13:59:33.731000+00:00,854,437,Sentrum Scene,ved Arbeidersamfunnets plass,59.915468,10.751141,583,Galgeberg,langs St. Halvards gate,59.907076,10.779164,1708


## Combine Data Tables

We have two files with the same kinds of data: `08.csv` with data for August and `09.csv` with data for September. How can we combine them into one DataFrame?

In [None]:
trips_aug = pd.read_csv("../data/08.csv", parse_dates=["started_at", "ended_at"])
trips_sep = pd.read_csv("../data/09.csv", parse_dates=["started_at", "ended_at"])

### Append tables with similar data

In [None]:
pd.concat([trips_aug, trips_sep])

In [None]:
pd.concat([trips_aug, trips_sep]).reset_index()

In [None]:
pd.concat([trips_aug, trips_sep]).reset_index(drop=True)

In [None]:
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)

In [None]:
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)
    trips = pd.read_csv(filnavn, parse_dates=["started_at", "ended_at"])

In [None]:
trips.started_at

In [None]:
months = []
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)
    months.append(pd.read_csv(filnavn, parse_dates=["started_at", "ended_at"]))

In [None]:
months

In [None]:
months = []
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)
    months.append(pd.read_csv(filnavn, parse_dates=["started_at", "ended_at"]))
trips = pd.concat(months).reset_index(drop=True)

In [None]:
trips

In [None]:
import pathlib

pathlib.Path.cwd().parent / "data"

In [None]:
(pathlib.Path.cwd().parent / "data").glob("*.csv")

In [None]:
list((pathlib.Path.cwd().parent / "data").glob("*.csv"))

In [None]:
months = []
for filnavn in ["../data/08.csv", "../data/09.csv"]:
    print(filnavn)
    months.append(pd.read_csv(filnavn, parse_dates=["started_at", "ended_at"]))
trips = pd.concat(months).reset_index(drop=True)

### Exercise

You have two datasets, one for August (08.csv) and another for September (09.csv). Each dataset contains information about bike trips. Your task is to combine these two datasets and find out the total number of bike trips for each station in these two months.

In [None]:
#fasit

# Read the data
trips_aug = pd.read_csv("../data/08.csv", parse_dates=["started_at", "ended_at"])
trips_sep = pd.read_csv("../data/09.csv", parse_dates=["started_at", "ended_at"])

# Combine the datasets
trips = pd.concat([trips_aug, trips_sep]).reset_index(drop=True)

# Count the number of trips for each station
station_counts = trips['start_station_name'].value_counts().reset_index()
station_counts.columns = ['start_station_name', 'trip_count']

print(station_counts)

### Join tables with common variables

In [None]:
num_trips = (
    trips.groupby("start_station_name")
    .size()
    .reset_index(name="num_trips")
    .sort_values(by="num_trips")
)
num_trips

In [None]:
trip_lengths = (
    trips.groupby("start_station_name")
    .agg(median_duration=("duration", "median"))
    .reset_index()
    .sort_values(by="median_duration")
)
trip_lengths

In [None]:
pd.merge(num_trips, trip_lengths)

In [None]:
num_trips_from = (
    trips.groupby("start_station_name")
    .agg(num_trips=("start_station_name", "size"))
    .sort_values(by="num_trips")
    .reset_index()
)
num_trips_from

In [None]:
num_trips_to = (
    trips.groupby("end_station_name")
    .agg(num_trips=("end_station_name", "size"))
    .sort_values(by="num_trips")
    .reset_index()
)
num_trips_to

In [None]:
pd.merge(num_trips_from, num_trips_to)

In [None]:
pd.merge(
    num_trips_from,
    num_trips_to,
    left_on="start_station_name",
    right_on="end_station_name",
)

In [None]:
popular_from = num_trips_from.nlargest(10, "num_trips")
popular_to = num_trips_to.nlargest(10, "num_trips")

In [None]:
pd.merge(
    popular_from, popular_to, left_on="start_station_name", right_on="end_station_name"
)

In [None]:
pd.merge(
    popular_from,
    popular_to,
    how="inner",
    left_on="start_station_name",
    right_on="end_station_name",
)

In [None]:
pd.merge(
    popular_from,
    popular_to,
    how="left",
    left_on="start_station_name",
    right_on="end_station_name",
)

In [None]:
pd.merge(
    popular_from,
    popular_to,
    how="right",
    left_on="start_station_name",
    right_on="end_station_name",
)

In [None]:
pd.merge(
    popular_from,
    popular_to,
    how="outer",
    left_on="start_station_name",
    right_on="end_station_name",
)

### Exercise 2

Merge the trips DataFrame with the stations DataFrame to add the address and number of bike docks for each start station to the trips DataFrame. We want to merge on the start_station_id in the trips DataFrame and the station_id in the stations DataFrame

In [None]:
stations = pd.DataFrame({
    'station_id': [564, 421, 621, 447, 430, 558, 424, 428],
    'station_name': ['Oscars gate', 'Alexander Kiellands Plass', 'Torshovdalen øst', 'Kværnerbyen', 'Spikersuppa Vest', 'Dokkveien', 'Birkelunden', 'Olav Kyrres plass'],
    'address': ['Oscars gate 1', 'Alexander Kiellands Plass 2', 'Torshovdalen øst 3', 'Kværnerbyen 4', 'Spikersuppa Vest 5', 'Dokkveien 6', 'Birkelunden 7', 'Olav Kyrres plass 8'],
    'num_docks': [10, 15, 12, 14, 16, 14, 15, 12]
})

In [None]:
#fasit

# Rename the columns in stations to indicate they are about the start station
stations.columns = ['start_station_id', 'start_station_name', 'start_station_address', 'start_station_num_docks']

# Merge the DataFrames
merged_trips = pd.merge(trips, stations, on=['start_station_id', 'start_station_name'], how='left')

print(merged_trips.head())

### Exercise 1

What is the 10 most popular destionations from Alexander Kiellands Plass? Sorted from most rides to frewest.  

In [None]:
#fasit

# Filter the DataFrame
filtered_trips = trips[trips['start_station_name'] == 'Alexander Kiellands Plass']

# Group by end_station_name and count, sort in descending order, take top 10
top_10_end_stations = filtered_trips.groupby('end_station_name').size().sort_values(ascending=False).head(10).reset_index(name='count')

print(top_10_end_stations)

## Self study - Sharing Insights

### Mess up data for presentation

In [None]:
from_to = (
    trips.groupby(["start_station_name", "end_station_name"])
    .agg(num_trips=("start_station_name", "size"))
    .reset_index()
    .sort_values(by="num_trips")
)

In [None]:
from_to.query(
    "start_station_name.isin(@popular_from.start_station_name) and end_station_name.isin(@popular_to.end_station_name)"
).pivot_table(
    index="start_station_name", columns="end_station_name", values="num_trips"
)

### Save to CSV

In [None]:
from_to.to_csv("from_to.csv", index=False)

### More visualizations

In [None]:
from_to

In [None]:
num_trips_to = (
    trips.groupby("end_station_name")
    .agg(num_trips=("end_station_name", "size"), lat=("end_station_latitude", "first"), lon=("end_station_longitude", "first"))
    .sort_values(by="num_trips")
    .reset_index()
)

In [None]:
import numpy as np
pd.merge(
    num_trips_from,
    num_trips_to,
    left_on="start_station_name",
    right_on="end_station_name",
    suffixes=("_from", "_to")
).assign(from_over_to=lambda df: np.log(df.num_trips_from/df.num_trips_to)).plot.scatter(x="lon", y="lat", c="from_over_to")