# Class 2: Coding Tour


### 1. Pandas
    
    1.1 Reading in a pandas dataframe
    1.2 Inspecting a dataframe
    1.3 Modifying columns
    1.4 Indexing and slicing
    1.5 Summary statistics
    1.6 Apply operations


### 2. NumPy

    2.1 Basic operations
    2.2 Joining arrays
    2.3 Multidimensional arrays
    2.4 Missing values


### 3. Classes


### 4. Reading and Writing Files

    4.1 Pandas
    4.2 NumPy
    4.3 Pickle
    4.4 Parquet
    4.5 Base Python
    


# 0. Setup

In [2]:
# SETUP

import os

if os.uname().sysname == 'Linux':
    base_dir = '/home/rask/'
else:
    base_dir = 'C:/Users/au535365/'

# 1. Pandas

In [3]:
import pandas as pd
import pyarrow.parquet as pq
parquet_file = os.path.join(base_dir, 'Dropbox/data/parlspeech/DK/ParlSpeech-V2/Corp_Folketing_V2.parquet')
csv_file = os.path.join(base_dir, 'Dropbox/research/stigma-paper/data/emotion/emotionality_scores.csv')

### 1.1 Reading and writing using Pandas

In [4]:
# Parquet file
df = pyarrow.parquet.read_table(parquet_file)
df = df.to_pandas()

AttributeError: module 'pyarrow' has no attribute 'parquet'

In [None]:
# CSV file
csv_df = pd.read_csv(csv_file)

In [None]:
# Also possible to read .dta files!
# pd.read_stata()

# And even .rds files using the pyreadr module!

In [None]:
# When writing a dataframe, you specify the name of the dataframe and then .to_csv
# df.to_csv(index=False)

### 1.2 Inspecting a dataframe

In [None]:
# Overview of what's in the dataframe:
# Number of rows
# Number of columns
# Type of columns

df.info()

In [None]:
# Get column names
df.columns

In [None]:
# See top n rows (see bottom n rows with .tail)
df.head(n=10)

In [None]:
# See unique values in column
df.date.unique().tolist()

### 1.3 Modifying columns

    - Renaming
    - Type casting

In [None]:
# Rename columns
df.rename(columns={'party.facts.id': 'party_id'}, inplace=True)

In [None]:
# Type casting of column. Bit complicated sometimes. I often use a work-around solution.
df['party_id'] = pd.Series([int(x) if pd.notnull(x) else None for x in df.party_id.tolist()], dtype=object)

### 1.4 Indexing and slicing

The most fundamental action in data analysis is the ability to “select” elements within your dataset.

Pandas indexing and slicing is a little bit tedious I think. 

Two ways:
* `.loc`  (label indexing)
* `.iloc` (integer indexing)

Usually, I prefer `.loc` since it is more explicit.

Often, we want to select specific rows, columns or even some specific cells.

In [None]:
# Subset a dataframe to include speeches given by legislators from S
df_s = df.loc[df['party'] == 'S']
df_s

In [None]:
# Extract first row with .loc
# What happens here?
df_s.loc[0]

In [None]:
# Extract first row with .iloc
# What happens here?
df_s.iloc[0]

In [None]:
# To enable using df_s.loc[0,:], we need to reset indices. Use with care!
df_s.reset_index(drop=True, inplace=True)
df_s.loc[0]

In [None]:
# Get speaker value for the 10th row using .loc
df_s.loc[10, 'speaker']

In [None]:
# This returns an error since iloc assumes we use integers to index the dataframe
df_s.iloc[10, 'speaker']

In [None]:
# Slice a dataframe from the first to the 20th element by taking every second value.
df_s.loc[0:20:2, ['party', 'speaker']]

In [None]:
# Remove rows with chair == False
df = df.loc[df['chair'] == False]

In [None]:
# Keep only speeches given by legislators from certain parties
red_parties = ['S', 'EL', 'SF', 'RV', 'ALT']
blue_parties = ['V', 'DF', 'KF', 'LA']

df = df.loc[df['party'].isin(red_parties + blue_parties)]
df

In [None]:
# Reset indices
df.reset_index(drop=True, inplace=True)

### 1.5 Summary statistics

In [None]:
# Count values by party
df['party'].value_counts()

In [None]:
# Count values by chair - note the different way of calling a column
df.chair.value_counts()

In [None]:
# Summary stats - only applicable to numerical values
df['terms'].describe()

### 1.6 Apply

The `apply()` inherits from the pandas module and is very effective in applying a function on an element-wise basis.

If we, for instance, want to convert all values in a column to 1 if X and to 0 if Y, this can be done using the `apply()`.




In [None]:
# We want to extract the year from the data variable. 
df['date'].apply(lambda x: x.split('-')[0])

In [None]:
# Can be directly assigned as a new column: 
df.loc[:, 'year'] = df['date'].apply(lambda x: x.split('-')[0])

In [None]:
# To see why it is neccesary, consider this example. Here we are computing the length of the text column. 
# Do you think the `len()` function returns the length at an element-wise operation? Well...
df['chars'] = len(df['text'])
print(df['chars'])

In [None]:
df.loc[:, 'chars'] = len(df['text'])

In [None]:
# We need the apply method!
df['chars'] = df['text'].apply(lambda x: len(x))
print(df['chars'])

In [None]:
# Another way to do element-wise operations, is by grouping dataframes by one or more columns. 
terms_df = df.groupby(['party'])['terms'].describe()

In [None]:
# Simple clean-up
terms_df['party'] = terms_df.index.tolist()
terms_df.reset_index(drop=True, inplace=True)

# Now we can see which parties speak more often. 
terms_df

In [None]:
df.groupby(['party'])['terms'].describe()

In [None]:
grouped_df = df.groupby(['party', 'year'])['terms'].describe().reset_index()
grouped_df

In [None]:
# Extract only the mean
grouped_df = df.groupby(['party', 'year'])['terms'].mean().reset_index()
grouped_df

In [None]:
# We need to make sure that Python interpret the data correctly.
# grouped_df['year'] = grouped_df['year'].astype(int)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()

In [None]:
# Plot the mean number of words for each party over time
fig, ax = plt.subplots()
for k, v in grouped_df.groupby('party'):
    v.plot(x='year', y='terms', label=k, ax=ax, marker='o', markersize=3)
plt.xlabel('Year', fontsize=10)
plt.ylabel('Average number of words', fontsize=10)
plt.xticks(size=10)
plt.yticks(size=10)
plt.legend(frameon=False, fontsize=10)
plt.show()

# 2. NumPy 

Almost everything in Python has ties to NumPy, short for **Numerical Python**. 

Often, you will be in a case where you have a Pandas dataframe, which you want to pass forward to NumPy. 

NumPy is very efficient and exploits that computers are very fast at using linear algebra. 

In [4]:
import numpy as np

### 2.1 Basic operations

In [262]:
# Create 1d array
myarr0 = np.arange(10)
myarr0

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]:
# Identical in values to:
mylist0 = list(range(10))

In [31]:
%time for _ in range(10): myarr1 = myarr0 * 2

CPU times: user 37 µs, sys: 28 µs, total: 65 µs
Wall time: 70.6 µs


In [18]:
%time for _ in range(10): mylist1 = [x * 2 for x in mylist0]

CPU times: user 10 µs, sys: 8 µs, total: 18 µs
Wall time: 22.9 µs


In [55]:
# What's the shape of array? (row, col) syntax
myarr1.shape

(10,)

In [264]:
# Indexing works just like for lists - get first two elements like this:
myarr0[:2]

array([0, 1])

In [39]:
# Filtering a 1d array
myarr1[myarr1 < 10]

array([0, 2, 4, 6, 8])

In [202]:
# Filtering numbers by multiple conditions
myarr1[(myarr1 >= 5) & (myarr1 <= 10) | (myarr1 < 2)]

array([ 0,  6,  8, 10])

In [40]:
# Replacing values in a 1d array
myarr1[myarr1 < 10] = 0

In [74]:
# Reshaping an array
myarr1.reshape(2,6)

# Now it is two dimensional!
myarr1.shape

(10,)

In [72]:
# Setting the col to -1 means that NumPy will figure it out
myarr2 = myarr1.reshape(2, -1)

(2, 5)

In [206]:
# # We can automatically increase the dimension of our array by using np.newaxis
print(f"Dimensions of original array: {myarr1.shape}")
print(f"Dimensions of modified array: {myarr1[:, np.newaxis].shape }")

Dimensions of original array: (10,)
Dimensions of modified array: (10, 1)


In [208]:
# Extract indices where two arrays match

a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

np.where(a == b)[0]

array([1, 3, 5, 7])

In [86]:
myarr0, myarr1

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18]))

In [101]:
# Find maximum value element-wise
a = np.array([5, 7, 9, 8, 6, 4, 5])
b = np.array([6, 3, 4, 8, 9, 7, 1])

# Find maximum value 
[max(a,b) for a, b in zip(a, b)]

[6, 7, 9, 8, 9, 7, 5]

In [104]:
# Find index of max and min value
np.argmax(a), np.argmin(b)

(2, 6)

In [105]:
# Stats
np.mean(a), np.std(a), np.max(a), np.min(a)

(6.285714285714286, 1.6659862556700857, 9, 4)

In [209]:
# Find unique values
a = np.array([0,0,1,1])
np.unique(a)

array([0, 1])

In [211]:
# Many numpy operations also work on lists!
np.unique([0,0,1, 1])

array([0, 1])

In [212]:
# Sort an array
a = np.array([1,2,9,5,4])
np.sort(a)

array([1, 2, 4, 5, 9])

In [215]:
# Get indices which will sort an array
np.argsort(a)

array([1, 2, 4, 5, 9])

In [None]:
a[np.argsort(a)]

In [None]:
# Save an array like this:
np.save('some_array', a)

### 2.2 Joining arrays

In [226]:
# Vertical stacking of arrays
a = np.arange(10)
b = np.repeat(1, 10)

m = np.vstack([a,b])
print(f"Array: \n{m}\n\n")
print(f"Shape of array: {m.shape}")

Array: 
[[0 1 2 3 4 5 6 7 8 9]
 [1 1 1 1 1 1 1 1 1 1]]


Shape of array: (2, 10)


In [70]:
a, b = a.reshape(2,5), b.reshape(2,5)

m = np.vstack([a,b])
print(f"Array: \n{m}\n\n")
print(f"Shape of array: {m.shape}")

Array: 
[[0 1 2 3 4]
 [5 6 7 8 9]
 [1 1 1 1 1]
 [1 1 1 1 1]]


Shape of array: (4, 5)


In [224]:
# Horizontal stacking - basically concatenation
# Solution:
a = np.arange(10)
b = np.repeat(1, 10)

m = np.hstack([a,b])
print(f"Array: \n{m}\n\n")
print(f"Shape of array: {m.shape}")

Array: 
[0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1]


Shape of array: (20,)


In [221]:
# Concatenation: row-wise
print(np.concatenate([a, b], axis=0))

[0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1]


In [230]:
# Concatenation: column-wise
np.concatenate([a, b], axis=1)

# Not possible due to 1d arrays!

AxisError: axis 1 is out of bounds for array of dimension 1

In [241]:
a, b = a.reshape(2,5), b.reshape(2,5)
np.concatenate([a, b], axis=1)

array([[0, 1, 2, 3, 4, 1, 1, 1, 1, 1],
       [5, 6, 7, 8, 9, 1, 1, 1, 1, 1]])

### 2.3 N-dimensional arrays (matrices)

The real power of NumPy comes from the ability to handle N dimensions and not just a single one. We have already seen that arrays can have more than one dimension.

In [106]:
# Create a matrix of dimension 3x3 of booleans. 

# One solution:
m0 = np.ones((3,3), dtype=bool)

# Another solution
m1 = np.ones((9), dtype=bool).reshape(3,3)

m0

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

In [157]:
# Generate 3x4 matrix with random numbers
array = np.random.randn(2, 3)

In [158]:
# Subsetting row 1, column 1 (NB: Zero-indexing!!!)
array[1,1] == array[1][1]

True

In [159]:
# indexing row 1, and col from 1 to the last col
array[1:,1:]

array([[1.176514  , 0.53616497]])

In [160]:
# Reverse rows
array[::-1 , :]

array([[ 0.42958903,  1.176514  ,  0.53616497],
       [ 0.81578153, -0.92932608,  0.18363494]])

In [161]:
# Reverse cols
array[: , ::-1]

array([[ 0.18363494, -0.92932608,  0.81578153],
       [ 0.53616497,  1.176514  ,  0.42958903]])

In [259]:
multiarray = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
multiarray.shape            # 2 arrays, each with dimension 2x3

(2, 2, 3)

In [261]:
multiarray = np.array([[[1, 2], [4, 5]], [[7, 8], [10, 11]], [[9, 10], [10, 10]]])
multiarray.shape            # 3 arrays, each with dimension 2x2

(3, 2, 2)

### 2.4 Missing values

In [185]:
# Find missing values using np.isnan
missing_array = np.array([1,2,2, np.nan, np.nan, 3])
missing_bool = np.isnan(missing_array)
missing_bool

array([False, False, False,  True,  True, False])

In [176]:
# Combine with np.where to get indices
np.where(missing_bool)[0]

array([3, 4])

In [184]:
# np.isnan also works on N-dimensional arrays
missing_array0 = np.array([1,2,2, np.nan, np.nan, 3])
missing_array1 = np.array([1,np.nan,2, 3, np.nan, 3])
np.isnan(np.vstack([missing_array0, missing_array1]))

array([[False, False, False,  True,  True, False],
       [False,  True, False, False,  True, False]])

# 3. Classes 

Classes are a fundamental concept in object-oriented programming, and they allow you to create custom data types with attributes (variables) and methods (functions) associated with them. 

You will rarely need to make classes yourself, but everything you use is a class!

In [274]:
# Define a class named "Student" - we use the 'class' keyword
class Student:
    
    # Constructor method (__init__) initializes the object's attributes
    def __init__(self, name, age, grade):
        self.name = name
        self.age = age
        self.grade = grade

    # Method to display student information
    def display_info(self):
        logage = np.log(self.age)
        print(f"Name: {self.name}, Age: {self.age}, Grade: {self.grade}")

In [272]:
# Create instances (objects) of the Student class
student1 = Student("Alice", 18, "A")
student2 = Student("Bob", 17, "B")

# Accessing attributes and calling methods of the objects
print("Student 1:")
student1.display_info()  

print("\nStudent 2:")
student2.display_info()

Student 1:
Name: Alice, Age: 18, Grade: A

Student 2:
Name: Bob, Age: 17, Grade: B


In [270]:
student1.name, student1.age, student1.grade

('Alice', 18, 'A')

In [275]:
student1.logage

AttributeError: 'Student' object has no attribute 'logage'

In [295]:
#  Sometimes, we want a class to inherit from another class. We will encounter this when we build neural networks with PyTorch.

class Person:
    """Example of the base class"""
    def __init__(self, name):
        self.name = name

    def get_name(self):
        """Get person name"""
        return self.name


class Employee(Person):
    def __init__(self, name, staff_id):
        super().__init__(name)
        self.staff_id = staff_id

    def get_full_id(self):
        """Get full employee id"""
        return self.get_name() + ', ' + str(self.staff_id)

In [297]:
emp = Employee(name='Bob', staff_id=10)

In [286]:
# Employee inherits the .get_name() method from the Person class!
emp.get_name()

'Bob'

In [300]:
# but we can still access the methods specific to the new class
emp.get_full_id()

'Bob, 10'

# 4. Files 

Reading and writing files in Python is usually quite simple and have, as everything else, awesome documentation and community-support. 

There are a few ways that it is good to get your hands around:

1. Pandas
2. NumPy
3. Pickle
4. Parquet
5. Base Python


### 4.1 Pandas

In [3]:
# Reading
csv_file = os.path.join(base_dir, 'Dropbox/research/stigma-paper/data/emotion/emotionality_scores.csv')
df = pd.read_csv(csv_file)

In [4]:
# Writing
df.to_csv('/home/rask/Dropbox/teaching/css_fall2023/data/emotionality_scores-new.csv', index=False)

### 4.2 NumPy

In [21]:
# Reading
from io import StringIO   # StringIO behaves like a file object

c = StringIO("0 1\n2 3")

a = np.loadtxt(c)

In [None]:
# Writing
np.savetxt('FILEPATH', c, delimiter=' ')

### 4.3 Pickle

In [8]:
import pickle

In [12]:
# Reading
# The 'rb' argument specifies that we are loading a file
fpath = '/home/rask/Dropbox/teaching/css_fall2023/data/diarization'
with open(fpath, 'rb') as f:
    data = pickle.load(f)
    f.close()

In [14]:
# Writing
# The 'wb' argument specifies that we are saving a file
fpath = '/home/rask/Dropbox/teaching/css_fall2023/data/diarization-new'
with open(fpath, 'wb') as f:
    pickle.dump(data, f)       # object in first pos, filepath in second pos
    f.close()

### 4.4 Parquet

In [16]:
import pyarrow as pa

In [17]:
parquet_table = pa.Table.from_pandas(df)

In [20]:
pa.parquet.write_table(parquet_table, '/home/rask/Dropbox/teaching/css_fall2023/data/emotionality_scores-new.parquet')

### 4.5 Base Python

In [43]:
import csv
fpath = '/home/rask/Dropbox/teaching/css_fall2023/data/tabseparated.txt'

# Reading
with open(fpath, newline = '') as games:                                                                                          
    	reader = csv.reader(games, delimiter='\t')
    	for read in reader:
    		print(read)

# You will never write a file like this, I assure you. Too tedious.

['Mette Frederiksen, Nicolai Wammen']
['Lars Løkke Rasmussen, Jon Stephensen']
['Jakob Ellemann-Jensen, Stephanie Lose']
