# Class 2: Coding Tour


### 1. Pandas
    
    1.1 Reading in a pandas dataframe
    1.2 Inspecting a dataframe
    1.3 Modifying columns
    1.4 Indexing and slicing
    1.5 Summary statistics
    1.6 Apply operations
    1.7 Groupby


### 2. NumPy

    2.1 Basic operations
    2.2 Joining arrays
    2.3 Multidimensional arrays
    2.4 Missing values


### 3. Classes


### 4. Reading and Writing Files
    4.0 Filepaths
    4.1 Pandas
    4.2 NumPy
    4.3 Pickle
    


# 1. Pandas

Pandas is *the* library in Python when working with tabular data.

But.... It sucks....

Compared to tidyverse or data.table in R (or the syntax in Stata), pandas is so messy.

But.... Since we are used to think in data in terms of rows and columns, it is the most convenient to store data in Python.

In [None]:
import pandas as pd

### 1.1 Reading and writing using Pandas

In [None]:
# Specify filepath. REMEMBER TO CHANGE THE PATH TO YOUR OWN!!!
# Note that you can download the data from the course GitHub repo in the data/ folder.
# It is very important to specify the correct filepath. 
# Whether the '/'s should turn / or \ this way me depend on your operating system
fpath = 'C:/Users/au535365/Dropbox/teaching/css_fall2023/data/Corp_Folketing.csv'

In [None]:
# Read in the filepath to pandas - we explain this later in the tutorial :-)
df = pd.read_csv(fpath)

## Detour - Absolute and Relative Filepaths

Directory, folders, and filepaths are boring, but key koncepts when working with data. We simply needs to know where our data, code, and so on are located.  

In [None]:
import os

In [None]:
# Get current working directory with os.getcwd()
os.getcwd()

In [None]:
# Change working directory with os.chdir()
os.chdir('C:/Users/au535365/Dropbox/teaching/css-fall2023')

In [None]:
os.chdir('C:/Users/au535365/Dropbox/teaching/css_fall2023')

In [None]:
# We are starting in the working directory and everything moves from there
os.listdir()

In [None]:
# Absolute path
fpath = 'C:/Users/au535365/Dropbox/teaching/css_fall2023/data/Corp_Folketing.csv'
df = pd.read_csv(fpath)

In [None]:
# Relative path
df = pd.read_csv('data/Corp_Folketing.csv')

### 1.2 Inspecting a dataframe

In [None]:
# Overview of what's in the dataframe:
# Number of rows
# Number of columns
# Type of columns

df.info()

In [None]:
# Get column names
df.columns

In [None]:
# See top n rows (see bottom n rows with .tail)
df.head(n=10)

In [None]:
# See unique values in column 
df.date.unique().tolist()

In [None]:
# Columns can also be accessed like this. I prefer this way.
df['date']

### 1.3 Modifying columns

    - Renaming
    - Type casting

In [None]:
# Rename columns
df.rename(columns={'party.facts.id': 'party_id'}, inplace=True)

In [None]:
# Type casting of column. Bit complicated sometimes if the column has NaNs. I often use a work-around solution.
df['party_id'] = pd.Series([int(x) if pd.notnull(x) else None for x in df.party_id.tolist()], dtype=object)

In [None]:
# Type casting is straightforward without NaNs
df['speechnumber'] = df['speechnumber'].astype(float)
df['speechnumber']

In [None]:
# Convert back to int
df['speechnumber'] = df['speechnumber'].astype(int)
df['speechnumber']

### 1.4 Indexing and slicing

The most fundamental action in data analysis is the ability to “select” elements within your dataset.

Pandas indexing and slicing is a little bit tedious I think. 

Two ways:
* `.loc`  (label indexing)
* `.iloc` (integer indexing)

Usually, I prefer `.loc` since it is more explicit.

Often, we want to select specific rows, columns or even some specific cells.

In [None]:
# Subset a dataframe to include speeches given by legislators from S
df_s = df.loc[df['party'] == 'S']
df_s

In [None]:
df.loc[df['party'] == 'S', 'test'] = str('hej')

In [None]:
# Extract first row with .loc
# What happens here?
df_s.loc[0]

In [None]:
# Extract first row with .iloc
# What happens here?
df_s.iloc[0]

In [None]:
# To enable using df_s.loc[0,:], we need to reset indices. NB: USE WITH CARE!!!!
df_s.reset_index(drop=True, inplace=True)
df_s.loc[0]

In [None]:
# Get speaker value for the 10th row using .loc
df_s.loc[10, 'speaker']

In [None]:
# This returns an error since iloc assumes we use integers to index the dataframe
df_s.iloc[10, 'speaker']

In [None]:
# Slice a dataframe from the first to the 20th element by taking every second value.
df_s.loc[0:20:2, ['party', 'speaker']]

In [None]:
# Remove rows with chair == False
df = df.loc[df['chair'] == False]

In [None]:
# Keep only speeches given by legislators from certain parties
red_parties = ['S', 'EL', 'SF', 'RV', 'ALT']
blue_parties = ['V', 'DF', 'KF', 'LA']

df = df.loc[df['party'].isin(red_parties + blue_parties)]
df

In [None]:
# Reset indices
df.reset_index(drop=True, inplace=True)

### 1.5 Summary statistics

In [None]:
# Count values by party
df['party'].value_counts()

In [None]:
# Count values by chair - note the different way of calling a column
df.chair.value_counts()

In [None]:
# Summary stats - only applicable to numerical values
df['terms'].describe()

### 1.6 Apply

The `apply()` inherits from the pandas module and is very effective in applying a function on an element-wise basis.

If we, for instance, want to convert all values in a column to 1 if X and to 0 if Y, this can be done using the `apply()`.




In [None]:
# We want to extract the year from the data variable. 
df['date'].apply(lambda x: x.split('-')[0])

In [None]:
# Can be directly assigned as a new column: 
df['year'] = df['date'].apply(lambda x: x.split('-')[0])

In [None]:
# Conditional value assignment 
df['party_S'] = df['party'].apply(lambda x: 'Yes' if x == 'S' else 'No')

In [None]:
# To see why it is neccesary, consider this example. Here we are computing the length of the text column. 
# Do you think the `len()` function returns the length at an element-wise operation? Well...
df['chars'] = len(df['text'])
print(df['chars'])

In [None]:
df['chars'] = len(df['text'])

In [None]:
# We need the apply method!
df['chars'] = df['text'].apply(lambda x: len(x))
print(df['chars'])

### 1.7 Groupby

The `.groupby()` operation is very convenient when we want to operate within groups.


In [None]:
# Grouping dataframes by one or more columns. 
terms_df = df.groupby(['party'])['terms'].describe()

In [None]:
terms_df

In [None]:
# Simple clean-up
terms_df['party'] = terms_df.index.tolist()
terms_df.reset_index(drop=True, inplace=True)

# Now we can see which parties use most words on average
terms_df

In [None]:
grouped_df = df.groupby(['party', 'year'])['terms'].describe().reset_index()
grouped_df

In [None]:
# Extract only the mean
grouped_df = df.groupby(['party', 'year'])['terms'].mean().reset_index()
grouped_df

In [None]:
df['party_mean_terms'] = df.groupby(['party', 'year'])['terms'].transform('mean')
df

In [None]:
# We need to make sure that Python interpret the data correctly.
# grouped_df['year'] = grouped_df['year'].astype(int)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()

In [None]:
# grouped_df['year'][0]
# grouped_df['year'] = grouped_df['year'].astype(int)

In [None]:
# Plot the mean number of words for each party over time
fig, ax = plt.subplots()
for k, v in grouped_df.groupby('party'):
    v.plot(x='year', y='terms', label=k, ax=ax, marker='o', markersize=3)
plt.xlabel('Year', fontsize=10)
plt.ylabel('Average number of words', fontsize=10)
plt.xticks(size=10)
plt.yticks(size=10)
plt.legend(frameon=False, fontsize=10)
plt.show()

# 2. NumPy 

Almost everything in Python has ties to NumPy, short for **Numerical Python**. 

Often, you will be in a case where you have a Pandas dataframe, which you want to pass forward to NumPy. 

NumPy is very efficient and exploits that computers are very fast at using linear algebra. 

In [None]:
import numpy as np

### 2.1 Basic operations

In [None]:
# Create 1d array
myarr0 = np.arange(10)
myarr0

In [None]:
# Identical in values to:
mylist0 = list(range(10))

In [None]:
# What's the shape of array? (row, col) syntax
myarr0.shape

In [None]:
# Indexing works just like for lists - get first two elements like this:
myarr0[:2]

In [None]:
# Filtering a 1d array
myarr0[myarr1 < 10]

In [None]:
# Filtering numbers by multiple conditions
myarr0[(myarr0 >= 5) & (myarr0 <= 10) | (myarr0 < 2)]

In [None]:
# Replacing values in a 1d array
myarr1[myarr1 < 10] = 0
myarr1

In [None]:
# Reshaping an array
myarr1 = myarr0.reshape(2,5)

# Now it is two dimensional!
myarr1.shape

In [None]:
# Setting the col to -1 means that NumPy will figure it out
myarr2 = myarr0.reshape(2, -1)
myarr2.shape

In [None]:
# # We can automatically increase the dimension of our array by using np.newaxis
print(f"Dimensions of original array: {myarr0.shape}")
print(f"Dimensions of modified array: {myarr0[:, np.newaxis].shape }")

In [None]:
# Extract indices where two arrays match

a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

np.where(a == b)[0]

In [None]:
# Find maximum value element-wise
a = np.array([5, 7, 9, 8, 6, 4, 5])
b = np.array([6, 3, 4, 8, 9, 7, 1])

[np.max([a,b]) for a, b in zip(a, b)]

In [None]:
# Find index of max and min value
np.argmax(a), np.argmin(a)

In [None]:
# Stats
np.mean(a), np.std(a), np.max(a), np.min(a)

In [None]:
# Find unique values
a = np.array([0,0,1,1])
np.unique(a)

In [None]:
# Many numpy operations also work on lists!
np.unique([0,0,1, 1])

In [None]:
# Sort an array
a = np.array([1,2,9,5,4])
np.sort(a)

In [None]:
# Get indices which will sort an array
np.argsort(a)

In [None]:
# Sort using argsort
a[np.argsort(a)]

In [None]:
# Check if array has certain elements
np.isin(a, [0, 1, 2])

In [None]:
# Save an array like this:
np.save('some_array', a)

### 2.2 Joining arrays

In [None]:
# Vertical stacking of arrays
a = np.arange(10)
b = np.repeat(1, 10)

m = np.vstack([a,b])
print(f"Array: \n{m}\n\n")
print(f"Shape of array: {m.shape}")

In [None]:
a, b = a.reshape(2,5), b.reshape(2,5)

m = np.vstack([a,b])
print(f"Array: \n{m}\n\n")
print(f"Shape of array: {m.shape}")

In [None]:
# Horizontal stacking - basically concatenation
# Solution:
a = np.arange(10)
b = np.repeat(1, 10)

m = np.hstack([a,b])
print(f"Array: \n{m}\n\n")
print(f"Shape of array: {m.shape}")

In [None]:
# Concatenation: row-wise
print(np.concatenate([a, b], axis=0))

In [None]:
# Concatenation: column-wise
np.concatenate([a, b], axis=1)

# Not possible due to 1d arrays!

In [None]:
a, b = a.reshape(2,5), b.reshape(2,5)
np.concatenate([a, b], axis=1)

### 2.3 N-dimensional arrays (matrices)

The real power of NumPy comes from the ability to handle N dimensions and not just a single one. We have already seen that arrays can have more than one dimension.

In [None]:
# Create a matrix of dimension 3x3 of booleans. 

# One solution:
m0 = np.ones((3,3), dtype=bool)

# Another solution
m1 = np.ones((9), dtype=bool).reshape(3,3)

m0==m1

In [None]:
# Generate 3x4 matrix with random numbers
array = np.random.randn(3, 4)
array

In [None]:
# Subsetting row 1, column 1 - What's the expected output?
array[1,1]

In [None]:
array[0,]

In [None]:
# indexing row 1, and col from 1 to the last col
array[1:,1:]

In [None]:
# Reverse rows
array[::-1 , :]

In [None]:
# Reverse cols
array[: , ::-1]

In [None]:
multiarray = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
multiarray.shape            # 2 arrays, each with dimension 2x3

In [None]:
multiarray = np.array([[[1, 2], [4, 5]], [[7, 8], [10, 11]], [[9, 10], [10, 10]]])
multiarray.shape            # 3 arrays, each with dimension 2x2

### 2.4 Missing values

In [None]:
# Find missing values using np.isnan
missing_array = np.array([1,2,2, np.nan, np.nan, 3])
missing_bool = np.isnan(missing_array)
missing_bool

In [None]:
# Combine with np.where to get indices
np.where(missing_bool)[0]

In [None]:
# np.isnan also works on N-dimensional arrays
missing_array0 = np.array([1,2,2, np.nan, np.nan, 3])
missing_array1 = np.array([1,np.nan,2, 3, np.nan, 3])
np.isnan(np.vstack([missing_array0, missing_array1]))

# 3. Classes 

Classes are a fundamental concept in object-oriented programming, and they allow you to create custom data types with attributes (variables) and methods (functions) associated with them. 

You will rarely need to make classes yourself, but everything you use is a class!

In [None]:
# Define a class named "Student" - we use the 'class' keyword
class Student:
    
    # Constructor method (__init__) initializes the object's attributes
    def __init__(self, name, age, grade):
        self.name = name
        self.age = age
        self.grade = grade

    # Method to display student information
    def display_info(self):
        logage = np.log(self.age)
        print(f"Name: {self.name}, Age: {self.age}, Grade: {self.grade}")

In [None]:
# Create instances (objects) of the Student class
student1 = Student("Alice", 18, "A")
student2 = Student("Bob", 17, "B")

# Accessing attributes and calling methods of the objects
print("Student 1:")
student1.display_info()  

print("\nStudent 2:")
student2.display_info()

In [None]:
student1.name, student1.age, student1.grade

In [None]:
student1.logage

In [None]:
#  Sometimes, we want a class to inherit from another class. We will encounter this when we build neural networks with PyTorch.

class Person:
    """Example of the base class"""
    def __init__(self, name):
        self.name = name

    def get_name(self):
        """Get person name"""
        return self.name


class Employee(Person):
    def __init__(self, name, staff_id):
        super().__init__(name)
        self.staff_id = staff_id

    def get_full_id(self):
        """Get full employee id"""
        return self.get_name() + ', ' + str(self.staff_id)

In [None]:
emp = Employee(name='Bob', staff_id=10)

In [None]:
# Employee inherits the .get_name() method from the Person class!
emp.get_name()

In [None]:
# but we can still access the methods specific to the new class
emp.get_full_id()

# 4. Files 

Reading and writing files in Python is usually quite simple and have, as everything else, awesome documentation and community-support. 

There are a few ways that it is good to get your hands around:

0. Filepaths
1. Pandas
2. NumPy
3. Pickle
4. Parquet
5. Base Python


### 4.0 Filepaths

Filepaths might seem simple, but often causes a lot of trouble, in particular across operating systems.

Luckily, Python has some great tools to make you work with paths.

The first important thing to notice is the difference between *absolute* and *relative* filepaths.

In [None]:
# What is this? 
'C:/Users/au535365/Dropbox/teaching/css_fall2023/data/Corp_Folketing.csv'

In [None]:
# What about this?
'data/Corp_Folketing.csv'

When you open a Jupyter Notebook, you always starts at the directory from which you opened the notebook. 

We can check it directly using the `os` module and the `.getcwd()` method, which is short for '**get** **c**urrent **w**orking **d**irectory'.

In [None]:
import os
os.getcwd()

We want to start from the outer folder of our project, in my case it is `css_fall2023`

I change it by using the `os` module and the `.chdir` method, which is short for '**ch**ange **dir**ectory'.

It is good practice to do that in the beginning of each notebook you are working with!

In [None]:
os.chdir('C:/Users/au535365/Dropbox/teaching/css_fall2023/')

In [None]:
os.getcwd()

### 4.1 Pandas

In [None]:
# Reading using absolute path
fpath = 'C:/Users/au535365/Dropbox/teaching/css_fall2023/data/Corp_Folketing.csv'
df = pd.read_csv(fpath)

In [None]:
# Reading using relative path
fpath = 'data/Corp_Folketing.csv'
df = pd.read_csv(fpath)

In [None]:
# Reading using relative path with extra added /
fpath = '/data/Corp_Folketing.csv'
df = pd.read_csv(fpath)

In [None]:
# Writing 
df[::20].to_csv('data/tester.csv', index=False)

### 4.2 NumPy

In [None]:
# Reading
from io import StringIO   # StringIO behaves like a file object

c = StringIO("0 1\n2 3")

a = np.loadtxt(c)

In [None]:
# Writing 
np.savetxt('data/nparray-txt.npy'), a, delimiter=' ')
np.save('data/nparray-npy.npy', a)

In [None]:
# Reading
np.loadtxt('data/nparray-txt.npy')

In [None]:
# Reading 
np.load('data/nparray-npy.npy')

### 4.3 Pickle

In [None]:
import pickle

In [None]:
# Reading
# The 'rb' argument specifies that we are loading a file
fpath = 'data/diarization'
with open(fpath, 'rb') as f:
    data = pickle.load(f)
    f.close()

In [None]:
# Writing
# The 'wb' argument specifies that we are saving a file
fpath = 'data/diarization-new'
with open(fpath, 'wb') as f:
    pickle.dump(data, f)       # object in first pos, filepath in second pos
    f.close()

### 4.4 Parquet (Optional)

In [None]:
# Parquet is a very efficient way of save data, but also a bit tricky. Play around if you feel for it.
# import pyarrow as pa
# parquet_table = pa.Table.from_pandas(df)
# pa.parquet.write_table(parquet_table, '/home/rask/Dropbox/teaching/css_fall2023/data/Corp_Folketing.parquet')

### 4.5 Base Python (Optional)

In [None]:
# import csv
# fpath = '/home/rask/Dropbox/teaching/css_fall2023/data/tabseparated.txt'

# # Reading
# with open(fpath, newline = '') as games:                                                                                          
#     reader = csv.reader(games, delimiter='\t')
#     for read in reader:
#         print(read)

# You will never write a file like this, I assure you. Too tedious.