<font style='font-size:1.5em'>**🧑‍🏫 Week 02 Lecture – A Python Crash Course** </font>

<font style='font-size:1.2em'>LSE DS105A – Data for Data Science (2024/25)</font>

<div style="color: #333333; background-color:rgba(93, 158, 188, 0.15); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px; margin: 10px; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

🗓️ **DATE:** 10 October 2024

⌚ **TIME:** 16.00-18.00

📍 **LOCATION:** CLM.5.02
</div>


**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: Showcase the basic concepts of Python programming language.

**REFERENCES:**

- 📘 [Python for Data Science book](https://drbeane.github.io/python/about.html) By Robbie Beane

---

In [19]:
import os
import numpy as np
import pandas as pd

from pprint import pprint

from lets_plot import *

LetsPlot.setup_html()

---

If the cell above throws an error when you run it, it's because you need to install additional Python libraries.

In that case, go to the menu and click "Terminal" -> "New Terminal". Then, on the terminal run:

```bash
pip install numpy pandas lets-plot
```

Wait for it to complete, then come back here (you can close the Terminal window), click "Restart" at the top of this notebook and try again.

⭐ Pro-Tip: Alternatively, you can run Terminal commands from here! Open a new Python cell below and add a `!` to your prompt, like this:

```bash
! pwd
```

---

# PART 1: Python Essentials

👉 You can always read the 🧑‍🏫 [W02 Lecture Notes](https://moodle.lse.ac.uk/mod/page/view.php?id=1533983#lecture-notes) to revisit the concepts later.

## 1. Recap the primitive data types

In [23]:
# This is a string
student_name = "John Doe"

# All of those below are integers (but could be floats)

w06_summative = 65
w09_summative = 63

w11_group_presentation = 80
final_project = 68
project_individ_contribution = 75

# This is a float (Watch as I format the line below for better readability)
final_grade = (
    0.2 * w06_summative + 
    0.3 * w09_summative + 
    0.1 * w11_group_presentation + 
    0.3 * final_project + 
    0.1 * project_individ_contribution)

**Data type questions:**

---

👉 What would be the data type of the operation below? (integer, float, string?)

```python
w06_summative * 2
```

---

👉 And what If I did it this way?

```python
w06_summative * 2.0
```

---

👉 What about this:

```python
w06_summative * "2"
```

(The weird behaviour above is explained by the concept of [type coercion](https://drbeane.github.io/python/pages/data_types/coercion.html))


In [None]:
# I will use this cell as a test ground for the questions above.

## 2. Lists and dictionaries

**Data:** The real data is on a CSV file, but for simplicity, I just pasted it here manually.

The two lists contain the real data about the final grade of DS105A & DS105W students last year.

In [43]:
DS105A_grades = [
    64.30,
    45.75,
    83.75,
    82.75,
    79.75,
    79.35,
    84.25,
    80.65,
    74.55,
    80.60,
    80.40,
    76.50,
    75.60,
    81.35,
    76.85,
    40.75,
    75.20,
    70.60,
    68.00,
    70.05,
    66.15,
    64.25,
    63.65,
    77.20,
    75.70,
    75.20,
    70.50,
    56.50,
    77.10,
    76.50,
    75.00,
    81.35,
    77.95,
    62.85,
    78.40,
    76.25,
    73.00,
    76.80,
    73.60,
    69.00,
    73.20,
    63.40,
    60.65,
    71.40,
    70.80,
    65.40,
    78.70,
    75.60,
    67.90,
    43.50,
    75.30,
    69.00,
    69.00,
    88.80,
    86.10,
    83.50,
    80.60,
    70.55,
    62.55,
    61.45,
    59.90,
    24.80,
    10.00,
    10.00,
    0.00,
    0.00,
    0.00,
]

DS105W_grades = [
    85.10,
    79.50,
    79.00,
    78.35,
    78.20,
    78.10,
    77.50,
    77.45,
    77.45,
    76.70,
    76.50,
    76.35,
    76.35,
    75.80,
    75.45,
    75.10,
    74.30,
    74.10,
    74.05,
    73.70,
    73.45,
    73.35,
    73.00,
    72.45,
    72.45,
    71.95,
    71.25,
    71.20,
    70.45,
    70.30,
    70.20,
    67.25,
    69.20,
    68.45,
    67.00,
    66.40,
    64.20,
    63.65,
    60.15,
    62.30,
    62.25,
    61.90,
    59.45,
    59.30,
    58.60,
    58.45,
    58.35,
    56.55,
    52.30,
    50.75,
    25.50,
]

How do we navigate a list? How do I check the length of a list?


[64.3, 45.75, 83.75, 82.75, 79.75, 79.35, 84.25, 80.65, 74.55, 80.6, 80.4, 76.5, 75.6, 81.35, 76.85, 40.75, 75.2, 70.6, 68.0, 70.05, 66.15, 64.25, 63.65, 77.2, 75.7, 75.2, 70.5, 56.5, 77.1, 76.5, 75.0, 81.35, 77.95, 62.85, 78.4, 76.25, 73.0, 76.8, 73.6, 69.0, 73.2, 63.4, 60.65, 71.4, 70.8, 65.4, 78.7, 75.6, 67.9, 43.5, 75.3, 69.0, 69.0, 88.8, 86.1, 83.5, 80.6, 70.55, 62.55, 61.45, 59.9, 24.8, 10.0, 10.0, 0.0, 0.0, 0.0]


### 2.1 One-word list operations:


**Q:** What was the highest grades in DS105A and DS105W last year?

In [51]:
max(DS105A_grades)
max_DS105W_grades = max(DS105W_grades)

In [50]:
print(max(DS105A_grades))
print(max(DS105W_grades))

88.8
85.1


<span style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:1.5em;font-size:0.85em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;width:40%;">💭 <strong>Think about it:</strong> Why do I see only one result on the first cell?</span>

In [None]:
# What about sum() and len()?

**Q:** What if I had forgotten to add the grade of one student in DS105A?

In [60]:
# In other words, how do I add an element to a list?
DS105A_grades.append([35])

In [62]:
fruits = ['apple', 'banana']

fruits.append('kiwi')

In [64]:
fruits.append(['peach'])

In [68]:
fruits[:1] + ['peach'] + fruits[1:]

['apple', 'peach', 'banana', 'kiwi']

**Q:** Ops, my bad, that student grade was already there. Let's remove it.


In [None]:
# I will use this cell to show you the power of pop()

<span style="color: #000000; font-size: 0.95em;background-color: #F6C73944; padding: 0.5em; border-radius: 5px; text-align: left;display:inline-block;margin-left:2em;">⚠️ Be careful: <code style="background-color: #f6f6f6;padding: .2em;font-size: 0.875em;color: #9753b8;white-space: pre-wrap;border-radius: .25rem;word-wrap: break-word;font-family:'SFMono-Regular','Menlo';direction: ltr;unicode-bidi: bidi-override;">list.pop()</code> removes the last element by default. If you want to remove a specific element, you need to specify its **index**.</span>

<span style="display:block;background-color:rgba(93, 158, 188, 0.1);color:#000;padding:0.75em;font-size:0.85em;margin-left:2em;margin-bottom:1em;border-radius:0.5em;width:40%;">📋 <strong>Take notes:</strong> Pay close attention as I mention the 'reproducibility issue' here.</span>


**Q:** Why not just use one single list to store all the grades? 

In [71]:
DS105_grades = DS105A_grades + DS105W_grades

In [73]:
['apple', 'banana'] + ['kiwi', 'peach']

['apple', 'banana', 'kiwi', 'peach']

In [72]:
len(DS105_grades)

118

I no longer need the two lists above, so I will delete them.

In [74]:
del DS105A_grades, DS105W_grades

### 2.2 Posing questions to the data

**Q:** What was the average grade in DS105 last year?

In [76]:
mean_DS105 = sum(DS105_grades) / len(DS105_grades)

# In pure python, without numpy:
std_DS105 = (sum((x - mean_DS105) ** 2 for x in DS105_grades) / len(DS105_grades)) ** 0.5

print("DS105:",  mean_DS105, "(+/-", std_DS105, ")")

DS105: 67.46737288135594 (+/- 17.000698361925107 )


<span style="display:block;background-color:rgba(93, 158, 188, 0.1);color:#000;padding:0.75em;font-size:0.85em;margin-left:2em;margin-bottom:1em;border-radius:0.5em;width:40%;">📋 <strong>Take notes:</strong> Now let me show you a few string formatting tricks with Python [**f-strings**](https://www.w3schools.com/python/python_string_formatting.asp):</span>

In [83]:
msg = f"DS105 grades distribution: {mean_DS105:.2f} (+-{std_DS105:.2f})"
msg

'DS105 grades distribution: 67.47 (+-17.00)'

Here's a distribution of the grades in DS105 last year (I don't expect you to understand the code below, but I want you to see the output): 

In [88]:
plot_df = (
    pd.DataFrame({'grade': DS105_grades})
    .assign(classification=lambda df: pd.cut(df['grade'], 
                                             bins=[0, 39, 49, 59, 69, 100], 
                                             labels=['Fail (0-39)', 'Third (40-49)', '2:2 (50-59)', '2:1 (60-69)', 'First (70-100)']))
    .groupby('classification', observed=False)
    .size()
    .reset_index(name='count')
    .assign(percentage=lambda df: df['count'] / df['count'].sum() * 100,
            percentage_str = lambda df: df.apply(lambda x: f'{x["count"]}\n({x["percentage"]:.2f}%)', axis=1))
)

g = (
    ggplot(plot_df, aes(x='classification', y='percentage', fill='classification')) +
    geom_bar(stat='identity') +
    geom_text(aes(label='percentage_str', color='classification'), fill='white', size=9, nudge_y=9) +
    coord_flip() +
    scale_fill_discrete(name='Classification') +
    scale_color_discrete(name='Classification') +
    scale_y_continuous(breaks=list(range(0, 101, 25)), limits=(0, 100), name="% of students") +
    ggsize(1000, 475) +
    labs(title="Examiners keep telling me that I must reduce the number of firsts!",
         subtitle="But if you keep producing innovative and creative work, I can't help it!",
         caption="Distribution of grades for DS105 (2023/24)",
         x="") +
    theme(axis_text_x=element_text(size=20),
          axis_text_y=element_text(size=16),
          plot_title=element_text(size=19, face='bold'))
)

# This will only render when opened inside VS Code
g

In [None]:
# Save the plot as a PNG file
# ggsave(g, "DS105_2023_24_grades.png", path=".")

---

# PART II: Reading files, Python functions and `for` loops

🔔 You can always read the 🧑‍🏫 [W02 Lecture Notes](https://moodle.lse.ac.uk/mod/page/view.php?id=1533983#lecture-notes) to revisit Python concepts later.

Before the lecture, I downloaded all your "bash_history.csv" files and saved them on the `./data/` folder. 

<details style="font-size:0.85em;margin-bottom: 1em;border:1px solid #aaa !important;border-radius: 4px !important;padding: .5em .5em 0.5em;"><summary>(Way too advanced) Click here to see the shell commands I used for that</summary>

When you clicked 'Hand In' here on Nuvolos, your submission was put on a `W02-review` folder for me. Each student is assigned a random number to you're not identifiable. 

The code below uses a command called `find` to search for the "bash_history.csv" files, then it copies it over to the `W02-Lecture/data/` folder with a different name.


```bash
find /files/W02-review/*/data -type f -name "bash_history.csv" | while read file; do cp "$file" "/files/W02-Lecture/data/bash_history_$(dirname $(dirname $file) | cut -d '_' -f 4).csv"; done
```

(The code won't work for you, as you won't have permission to view the `/files/W02-review/` folder)

</details>

In [None]:
# How many bash_history*.csv files do we have?

!ls ./data

## 3. Reading files

If I restart my machine, any **Python variables** I have written on the Python shell, or this Jupyter Notebook, will be lost... forever (watch me as I do a little demo of that). **Data stored on files, on the other hand, persist**. That is why we like to store data in files.

On the 📝 [W02 Formative Exercise](https://moodle.lse.ac.uk/mod/page/view.php?id=1533681), we've shown you how to read a CSV file in "pure" Python (no external libraries):

In [89]:
with open('./data/bash_history_0db536d3.csv', 'r') as file:
    user_bash_history = file.read()

user_bash_history

'1 ,2024-10-09-11:28:38, python\n2 ,2024-10-09-11:37:16, ls\n3 ,2024-10-09-11:38:48, mkdir[data]\n4 ,2024-10-09-11:39:33, mkdir[code]\n5 ,2024-10-09-11:40:09, ls\n6 ,2024-10-09-11:40:47, cd[W02-Practice]\n7 ,2024-10-09-11:41:00, pwd\n8 ,2024-10-09-11:41:20, ls\n9 ,2024-10-09-11:42:46, mkdir[W02-Practice]\n10 ,2024-10-09-11:44:03, touch[code]\n11 ,2024-10-09-11:44:17, mkdir [code]\n12 ,2024-10-09-11:45:51, mv [code][W02-Practice]\n13 ,2024-10-09-11:46:36, cd [W02-Practice]\n14 ,2024-10-09-11:47:06, cd [\n15 ,2024-10-09-11:47:23, cd [W02-Practice]\n16 ,2024-10-09-14:00:44, history\n17 ,2024-10-09-14:06:32, ls\n18 ,2024-10-09-14:06:39, cd W02-Practice\n19 ,2024-10-09-14:06:46, rm -r data\n20 ,2024-10-09-14:06:48, ls\n21 ,2024-10-09-14:06:59, mkdir data\n22 ,2024-10-09-14:07:02, cd data\n23 ,2024-10-09-14:07:05, mkdir code\n24 ,2024-10-09-14:07:16, cd W02-Practice\n25 ,2024-10-09-14:08:48, cd data\n26 ,2024-10-09-14:09:42, cd ..\n27 ,2024-10-09-14:09:52, cd\n28 ,2024-10-09-14:10:26, cd W02

In [90]:
type(user_bash_history)

str

CSV is a plain-text file, so its content is transparent to us. When we read this file into Python, we get it as a **string object**. Here, I chose to save it in the `user_bash_history` variable so I can reuse it.

## 4. Processing the string

You've seen that the symbol `\n` in a string represents a break of line and that we can split this huge string into a list:

In [91]:
# I will override the user_bash_history variable
user_bash_history = user_bash_history.split('\n')
user_bash_history

['1 ,2024-10-09-11:28:38, python',
 '2 ,2024-10-09-11:37:16, ls',
 '3 ,2024-10-09-11:38:48, mkdir[data]',
 '4 ,2024-10-09-11:39:33, mkdir[code]',
 '5 ,2024-10-09-11:40:09, ls',
 '6 ,2024-10-09-11:40:47, cd[W02-Practice]',
 '7 ,2024-10-09-11:41:00, pwd',
 '8 ,2024-10-09-11:41:20, ls',
 '9 ,2024-10-09-11:42:46, mkdir[W02-Practice]',
 '10 ,2024-10-09-11:44:03, touch[code]',
 '11 ,2024-10-09-11:44:17, mkdir [code]',
 '12 ,2024-10-09-11:45:51, mv [code][W02-Practice]',
 '13 ,2024-10-09-11:46:36, cd [W02-Practice]',
 '14 ,2024-10-09-11:47:06, cd [',
 '15 ,2024-10-09-11:47:23, cd [W02-Practice]',
 '16 ,2024-10-09-14:00:44, history',
 '17 ,2024-10-09-14:06:32, ls',
 '18 ,2024-10-09-14:06:39, cd W02-Practice',
 '19 ,2024-10-09-14:06:46, rm -r data',
 '20 ,2024-10-09-14:06:48, ls',
 '21 ,2024-10-09-14:06:59, mkdir data',
 '22 ,2024-10-09-14:07:02, cd data',
 '23 ,2024-10-09-14:07:05, mkdir code',
 '24 ,2024-10-09-14:07:16, cd W02-Practice',
 '25 ,2024-10-09-14:08:48, cd data',
 '26 ,2024-10-09-1

Each element of this list is a string, and we can further split it:

In [94]:
user_bash_history[0].split(',')

['1 ', '2024-10-09-11:28:38', ' python']

How do I repeat that for **all** elements in the list? That's the point of a `for` loop:

In [95]:
lines = []
for line in user_bash_history:
    split_line = line.split(',')
    lines.append(split_line)

In [96]:
len(lines)

30

## 5. Adding more info to each line

**What if I wanted to add the username to each one of these lines?**

That is, how can I ensure that the first line goes from:

```output
['1', '2024-10-09-11:28:38', ' python']
```

to:

```output
['1', '2024-10-09-11:28:38', ' python', 'Jon']
```

?

I'd have to append to the end of the list:

In [None]:
line = user_bash_history[0].split(',') 
line.append('Jon')
line

There's a shorter way to do this. If `Jon` is inside a list, I can add them together with the `+` operator:

In [None]:
user_bash_history[0].split(',') + ['Jon']

**How do I edit my `for` loop to accomodate this new idea?**

In [97]:
lines = []

for line in user_bash_history:
    lines.append(line.split(',') + ['Jon'])

lines

[['1 ', '2024-10-09-11:28:38', ' python', 'Jon'],
 ['2 ', '2024-10-09-11:37:16', ' ls', 'Jon'],
 ['3 ', '2024-10-09-11:38:48', ' mkdir[data]', 'Jon'],
 ['4 ', '2024-10-09-11:39:33', ' mkdir[code]', 'Jon'],
 ['5 ', '2024-10-09-11:40:09', ' ls', 'Jon'],
 ['6 ', '2024-10-09-11:40:47', ' cd[W02-Practice]', 'Jon'],
 ['7 ', '2024-10-09-11:41:00', ' pwd', 'Jon'],
 ['8 ', '2024-10-09-11:41:20', ' ls', 'Jon'],
 ['9 ', '2024-10-09-11:42:46', ' mkdir[W02-Practice]', 'Jon'],
 ['10 ', '2024-10-09-11:44:03', ' touch[code]', 'Jon'],
 ['11 ', '2024-10-09-11:44:17', ' mkdir [code]', 'Jon'],
 ['12 ', '2024-10-09-11:45:51', ' mv [code][W02-Practice]', 'Jon'],
 ['13 ', '2024-10-09-11:46:36', ' cd [W02-Practice]', 'Jon'],
 ['14 ', '2024-10-09-11:47:06', ' cd [', 'Jon'],
 ['15 ', '2024-10-09-11:47:23', ' cd [W02-Practice]', 'Jon'],
 ['16 ', '2024-10-09-14:00:44', ' history', 'Jon'],
 ['17 ', '2024-10-09-14:06:32', ' ls', 'Jon'],
 ['18 ', '2024-10-09-14:06:39', ' cd W02-Practice', 'Jon'],
 ['19 ', '2024-10-0

**I could convert this to a table (a DataFrame) later:**

(we will learn about data frames properly next week)

In [98]:
pd.DataFrame(lines, columns=['line_number', 'timestamp', 'command', 'username'])

Unnamed: 0,line_number,timestamp,command,username
0,1.0,2024-10-09-11:28:38,python,Jon
1,2.0,2024-10-09-11:37:16,ls,Jon
2,3.0,2024-10-09-11:38:48,mkdir[data],Jon
3,4.0,2024-10-09-11:39:33,mkdir[code],Jon
4,5.0,2024-10-09-11:40:09,ls,Jon
5,6.0,2024-10-09-11:40:47,cd[W02-Practice],Jon
6,7.0,2024-10-09-11:41:00,pwd,Jon
7,8.0,2024-10-09-11:41:20,ls,Jon
8,9.0,2024-10-09-11:42:46,mkdir[W02-Practice],Jon
9,10.0,2024-10-09-11:44:03,touch[code],Jon


# 6. Exercise working with dictionaries:

We can achieve the same result using a list of dictionaries instead of a flat list.

<details style="background-color:white;color:white;height:1px"><summary>Hidden solution</summary>

```python
lines = []

for line in user_bash_history:
    line = line.split(',')

    if len(line) == 1:
        continue

    current_line = {
        'line_number'   :  int(line[0]),
        'timestamp'     :  line[1].strip(),
        'command'       :  line[2:].strip(),
        'username'      :  'Jon'
    }
    lines.append(current_line)

pd.DataFrame(lines)

```

</details>

In [None]:
# Watch me as I demonstrate that

## 7. Adapting our code to consume ALL files

<div style="font-size:0.85em;width:70%;display:block;margin-left:1em;">

🧩 **Here's a puzzle:** we want to do that **for each one of the 38** individual `bash_history.csv` files we have and the **username should not be 'Jon' for everyone!**

</div>

<details style="font-size:0.85em;margin-bottom: 1em;border:1px solid #aaa !important;border-radius: 4px !important;padding: .5em .5em 0.5em;"><summary>Click here to see a <strong>VERY VERY BAD WAY</strong> to achieve that:</summary>

The code below works but it hurts my eyes deeply! Do you get why this is a bad way to write code?

```python

# A variable to store bash history of all students
all_lines = []

with open("./data/bash_history_0db536d3.csv") as file:
    user_bash_history = file.read().split('\n')

for line in user_bash_history:
    current_line = line.split(' ,')
    # Adds the username
    current_line.append('0db536d3')
    all_lines.append(current_line)


with open("./data/bash_history_99605b72.csv") as file:
    user_bash_history = file.read().split('\n')

for line in user_bash_history:
    current_line = line.split(' ,')
    # Adds the username
    current_line.append('99605b72')
    all_lines.append(current_line)

with open("./data/bash_history_13220409.csv") as file:
    user_bash_history = file.read().split('\n')

for line in user_bash_history:
    current_line = line.split(' ,')
    # Adds the username
    current_line.append('13220409')
    all_lines.append(current_line)

with open("./data/bash_history_9fe1f569.csv") as file:
    user_bash_history = file.read().split('\n')

for line in user_bash_history:
    current_line = line.split(' ,')
    # Adds the username
    current_line.append('9fe1f569')
    all_lines.append(current_line)

# TODO: Add remaining 26 files

```

</details>

I will create the good solution live during the lecture while teaching you about **functions** and the `os` module in Python.

<details style="background-color:white;color:white;height:1px"><summary>Click here to see a better solution</summary>

```python
def read_bash_history(filename):
    with open(filename) as file:
        user_bash_history = file.read().split('\n')

    current_file_lines = []
    for line in user_bash_history:
        current_line = line.split(',')

        if len(current_line) == 1:
            continue

        current_dict = {
            'line_number'   :  current_line[0].strip(),
            'timestamp'     :  current_line[1].strip(),
            'command'       :  ''.join(current_line[2:]).strip(),
            'username'      :  filename.split('_')[-1].split('.')[0]
        }
        
        current_file_lines.append(current_dict)

    return current_file_lines

filenames = [os.path.join('./data/', csv_filename) 
             for csv_filename in os.listdir('./data')]

all_lines = []

for filename in filenames:
    all_lines.extend(read_bash_history(filename))
```

</details>

In [None]:
pd.DataFrame(all_lines)

In [None]:
print(f"In the end we should have gathered: {len(all_lines)} lines")

In [None]:
# Sample the first line
all_lines[0]

## 7. Exploratory Data Analysis

(If time allows) We will explore the top commands used

In [None]:
df = pd.DataFrame(all_lines)

**Q:** On average, how many Terminal commands did each student use?

<details style="background-color:white;color:white;height:1px"><summary>Hidden solution</summary>

```python
num_commands = df['username'].value_counts().tolist()

mean_commands = sum(num_commands) / len(num_commands)
std_commands = (sum((x - mean_commands) ** 2 for x in num_commands) / len(num_commands)) ** 0.5

print(f"Mean number of commands: {mean_commands:.2f} (+/- {std_commands:.2f})")

```

</details>

Which commands were most popular?

<details style="background-color:white;color:white;height:1px"><summary>Hidden solution</summary>

```python
df.groupby(['command'])['username'].nunique().sort_values(ascending=False).head(10).reset_index(name="Unique students")

```

</details>