# Computational Methods HT2025 - Week 1. Getting Started.

Before this lab we expect you to have: 
- A Github account
- A working personal computer with Python installed. 
	- For Mac, it is pre-installed. 
	- For Windows, ...
- Download and installed "Visual Studio Code"

**Data**: AI Stack Exchange, already collected (placed in `/data/AI_Stack_Posts.feather`)

**Claim**: Distributions vary dramatically on whether they are normal or skewed.

**Representation**: Simple plots and descriptives


# Exercise 1. Getting the data in 

The first excercise is simply getting the data into Python. The data is stored in a file format called "feather". You will likely need to install an additional package. 

Select kernel -> venv -> go to terminal and type: 

~~~bash
pip install pyarrow
~~~

You will also need to install pandas

~~~bash
pip install pandas, matplotlib, scipy
~~~

Then you should be able to run the code below: 

In [None]:
import pandas as pd 
from IPython.display import display, Markdown, HTML
import matplotlib.pyplot as plt
from scipy import stats

stack_df = pd.read_feather('data/AI_Stack_Posts.feather')

display(stack_df.info())

# Exercise 2. Descriptive data 

In this first exercise, we should do some counting. To note, the structure of this file is very similar to the structure found at: https://meta.stackexchange.com/questions/2677/

We can see that there are a couple different kinds of posts. First, count the number of posts and the number of replies. There are a few ways to do this, but I recommend a simple `value_counts()`:

In [None]:
stack_df[<...>].value_counts()

In [None]:
# Create a data set just of the posts: 
# Answer should say at the top: 
# Index: 12380 entries, 0 to 26763

posts_df = stack_df[stack_df["PostTypeId"]== <...>].copy()

print(posts_df.info())

In [None]:
# Check the text of the first post: 
# Answer should display: 
# "What does "backprop" mean? Is the "backprop" term basically the same as "backpropagation" or does it have a different meaning?"

display(HTML(posts_df["Body"].<...>[0]))

# Exercise 3. Cleaning some data 

Let's get the count for the number of characters in a column
Let's get the count for the number of words in a column. 

We should use "BodyText" rather than "Body" as it's been stripped of HTML

In [None]:
# To start let's count the characters for a single entry: 

print(len(posts_df["Body"].<...>[0]))
print(len(posts_df["BodyText"].<...>[0]))

In [None]:
# Now let's see how to do that for an entire column
# Answer: 127 

posts_df["BodyTextLen"] = posts_df[<...>].map(lambda x: len(x))
print(posts_df["BodyTextLen"].<...>[0])


In [None]:
# Now it's your turn: Instead of counting the characters let's count the words

posts_df["BodyWordLen"] = posts_df[<...>].map(lambda x: len(<...>))
print(posts_df["BodyWordLen"].<...>[0])


In [None]:
# Exercise 4. Plotting a distribution

plt.hist(posts_df["BodyTextLen"])
plt.show()

In [None]:
# Now plot the characters. Except we will do it twice! 
# Once on a normal plot. Once on a "log-log" plot

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.hist(<...>)
ax1.set_title("<...>")

ax2.hist(<...>)
ax2.set_title("<...>")
ax2.set_xscale("log")
ax2.set_yscale("log")

plt.tight_layout()
plt.show()

# Question 5 - Descriptive statistics 

For the following, I am not providing starter code (except the one snippet below). Instead, you should be able to piece together the way to do this from the above code. 

5a. Split the posts_df into two separate DataFrames. DataFrame 1 will have posts which have a URL (i.e. `[posts_df["<...>"] > 0]`). DataFrame 2 will have posts which do not (`[posts_df["<...>"] > 0]`). 

5b. Report on the mean, median, and standard deviation of the "Score" column for both. 

5c. How can we compare whether one is 'significantly higher' than the other? Hint: We can use a t-test. But should we use a parametric or non-parametric test?

5d. Plot the histogram of the two series side by side.

In [None]:
# Example code: 
s1 = pd.Series([1,2,3,4,5,6,7,8])
s2 = pd.Series([5,2,-5,4,0,1,7,5])

# Parametric: independent samples t-test
t_stat, p_value = stats.ttest_ind(s1, s2)
print(t_stat, p_value)

# Non-parametric: Mann-Whitney U test
u_stat, p_value = stats.mannwhitneyu(s1, s2)
print(u_stat, p_value)

In [None]:
# Your code below here: 

# Exercise 6. Choose a different variable and repeat

Instead of using "Score", explore the data set and select or create a variable. Then compare that variable between two groups. They can be different groups than "do or don't have a URL". Posit a reason why those groups would be significantly different. Report:
- Mean score
- Median score
- A plot of the numeric value for each of the groups separately
- A t-test (parametric or non-parametric)
- And give a short interpretation of your findings. 

In [None]:
# Your code below here:

# AI Declaration 



The code was written primarily by myself. I used Claude for a few reminders here and there (like `plt.yscale` vs `ax1.set_yscale()`). 