### MY470 Computer Programming

### Final Assignment, AT 2024

#### \*\*\* Due 12:00 noon on Monday, January 20, 2025 \*\*\*

---
## The homophily and social contagion of cheating

The final assignment asks you to use the computational thinking and programming skills you learned in the course to answer an empirical social science question. You are expected to apply the best practices and theoretical concepts we covered in the course to produce a program that not only returns the correct output but is also legible, modular, and reasonably optimized. The assignment assumes mastery of loops, conditionals, and functions, as well as awareness of issues related to runtime performance.

In the assignment, we will study the homophily and social contagion of cheating in the massive multiplayer online game PlayerUnknown's Battlegrounds (PUBG). Cheating in this context means the adoption of unapproved software that gives the player an unfair advantage in the game (e.g. being able to see through walls). 

Our hypotheses are that cheaters tend to associate with other cheaters but also, players who interact with cheaters become likely to adopt cheating themselves. To provide preliminary evidence for these hypotheses, we will:

1. Observe whether cheaters tend to team up with other cheaters more than chance.
2. Observe whether players who observe cheaters are likely to become cheaters more than chance.
3. Observe whether players who are killed by cheaters are likely to become cheaters more than chance.

To test the "more than chance" part, we will simulate alternative universes in which the players played the same game but joined a different team or happened to be killed by someone else at a different time. We will then compare how what we observe in the actual data compares to what we would expect in a "randomized" world.  

**NOTE: You are only allowed to use fundamental Python data types (lists, tuples, dictionaries, etc.) and basic modules (datetime, pickle, deepcopy, itertools, os, etc.) to complete this assignment.** You are not allowed to use advanced data querying and data analysis packages such as pandas, sqlite, networkx, numpy, or similar. We impose this restriction in order to test your grasp of fundamental programming concepts, not your scripting experience with Python libraries from before or from other courses you may be taking. 

**NOTE: You can use Copilot and similar generative-AI tools. However, you are expected to work on your own.** Hence, you should not discuss or obtain solutions from others (whether current or past students, family members, neighbors, paid consultants, essay farm workers, volunteers on StackOverflow, etc.). 

#### Hints

Although this assignment is quite streamlined, imagine that the tasks here are part of a larger project. How would you structure your program if in the future you may need to use a different dataset with similar structure, manipulate the data differently, add additional analyses, or modify the focus of the current analysis?  

Keep different data manipulations in separate functions/methods and group related functions/classes in separate `.py` files. Name your modules in an informative way based on their functionality. Do not use names that refer to the problem set outline here.

### Data

You will find the data in the [data](https://github.com/lse-my470/data) repository for the course. Any time when you refer to the data in your code, please use a relative path such as `'../data/filename.txt'` so that we can test your submission with our own copy of the data without having to modify your code.

The data were collected by Jinny Kim (LSE MSc ASDS '19). The repository contains the following files:

* `cheaters.txt` – contains cheaters who played between March 1 and March 10, 2019
    1. player account id
    2. estimated date when the player started cheating
    3. date when the player's account was banned due to cheating


* `kills.txt` – contains the killings done in 6,000 randomly selected matches played between March 1 and March 10, 2019
    1. match id 
    2. account id of the killer
    3. account id of the player who got killed
    4. time when the kill happened
 
 
* `team_ids.txt` – contains the team ids for players in 5,419 team-play matches in the same period. If a match from the kills.txt file does not appear in these data, we will assume that it was in single-player mode.  
    1. match id 
    2. player account id
    3. team id in match

### Output

The tasks ask you to output actual counts and expecteded counts (mean with 95% confidence interval). To estimate the 95% conifdence intervals, ignore the small sample size and the fact that we are dealing with count data, and simply use the approximation: 95% CI $= \mu \pm 1.96 \frac{\sigma}{\sqrt{n}}$, where $\mu$ is the mean and $\sigma$ the standard deviation of the counts in the $n=20$ randomizations. You are free to use `statsmodels` or `numpy` to calculate these values.


#### Hints

When writing your code, test it on a small "toy dataset", instead of the entire data. This way, you won't need to wait for minutes/hours just to find out that you have a syntax error!

If the randomization is time consuming, it may be worth finding a way to save the data you generate on hard disk so that you don't need to run the randomization again and again. If you decide to do so, please write your code to save any such files with processed data in the directory where this file resides. Do NOT save any data in `../data/`!

If you need to save any new data, think carefully about the most efficient way, both in terms of time and space, to save them. Your code will be evaluated on both time and space efficiency.

## Import and run your code here

Keep your code in separate `.py` files and then import functions, etc. in the code cell below. In the subsequent cells, call the functions/methods you need to conduct the requested analyses. We should be able to run all cells here to calculate again the results and get the requested output, without having to modify your code in any way. 

In [1]:
import output
import simulations

In [2]:
# File paths
FPATH_teams = '../data/team_ids.txt'
FPATH_cheaters = '../data/cheaters.txt'
FPATH_kills = '../data/kills.txt'

# Load data
teams_data, cheaters_data, kills_data = output.prepare_data(FPATH_teams, FPATH_cheaters, FPATH_kills)


### 1. Do cheaters team up?

Use the files `cheaters.txt` and `team_ids.txt` to estimate how often cheaters (regardless of when exactly they started cheating) end up on the same team. Your output should say how many teams have 0, 1, 2, 3, or 4 cheaters.

Now, randomly shuffle the team ids among the players in a match. Repeat this 20 times and estimate the expected counts as before. Output the mean and the 95% confidence intervals for the expected counts. 

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

In [3]:
# Answer 1
real_cheater_counts, ci_sim_cheater_counts = output.answer_1(teams_data, cheaters_data)

print("1.1 Cheaters per team:")
for category, count in real_cheater_counts.items():
    print(f"  {category} cheater(s): {count}")

print("1.2 Simulation Results:")
for category, stats in ci_sim_cheater_counts.items():
    print(f"  {category} cheater(s):")
    print(f"    Mean: {stats['mean']}")
    print(f"    Standard Deviation: {stats['std_dev']}")
    print(f"    95% Confidence Interval: [{stats['ci_lower']}, {stats['ci_upper']}]")

1.1 Cheaters per team:
  0 cheater(s): 170782
  1 cheater(s): 3199
  2 cheater(s): 182
  3 cheater(s): 9
  4 cheater(s): 2
1.2 Simulation Results:
  0 cheater(s):
    Mean: 170611.0
    Standard Deviation: 4.7
    95% Confidence Interval: [170608.9, 170613.1]
  1 cheater(s):
    Mean: 3528.2
    Standard Deviation: 9.4
    95% Confidence Interval: [3524.1, 3532.3]
  2 cheater(s):
    Mean: 34.5
    Standard Deviation: 4.7
    95% Confidence Interval: [32.4, 36.6]
  3 cheater(s):
    Mean: 0.2
    Standard Deviation: 0.4
    95% Confidence Interval: [0.0, 0.4]
  4 cheater(s):
    Mean: 0.0
    Standard Deviation: 0.0
    95% Confidence Interval: [0.0, 0.0]


### 2. Do victims of cheating start cheating?

Use the files `cheaters.txt` and `kills.txt` to count how many players got killed by an active cheater on at least one occasion and then started cheating. Specifically, we are interested in situations where:

1. Player B has started cheating but player A is not cheating.
2. Player B kills player A.
3. At some point afterwards, player A starts cheating.

Output the count in the data. 

Then, simulate alternative worlds in which everything is the same but the events took somewhat different sequence. To do so, randomize within a game, keeping the timing and structure of interactions but shuffling the player ids. Generate 20 randomizations like this and estimate the expected count of victims of cheating who start cheating as before. Output the mean and the 95% confidence interval for the expected count in these randomized worlds.

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

#### Hint

Starting time of cheating is estimated as a date, so assume that a player cheats on any match that they started playing on that date or afterwards. Use the first kill in a match to determine the match starting date. Thus, if the first kill in the match occurred before midnight of the cheating date but the match ended after midnight, we will assume that the player was not cheating just yet. 


In [4]:
# Precompute randomized kills (shared randomization between Answer 2 and Answer 3)
randomized_kills = simulations.randomize_player_ids(kills_data, num_randomizations=20)

# Answer 2
converted_from_killed, ci_confidence_intervals = output.answer_2(kills_data, cheaters_data, randomized_kills)

print(f"2.1 Count of IDs killed by a cheater that became cheaters: {converted_from_killed}")
print(
    "2.2 Simulation Results:\n"
    f"  Mean: {ci_confidence_intervals['mean']}\n"
    f"  Standard Deviation: {ci_confidence_intervals['std_dev']}\n"
    f"  95% Confidence Interval: [{ci_confidence_intervals['ci_lower']}, {ci_confidence_intervals['ci_upper']}]"
)


2.1 Count of IDs killed by a cheater that became cheaters: 47
2.2 Simulation Results:
  Mean: 11.2
  Standard Deviation: 3.1
  95% Confidence Interval: [9.8, 12.6]


### 3. Do observers of cheating start cheating?

Use the files `cheaters.txt` and `kills.txt` to count how many players observed an active cheater on at least one occasion and then started cheating. Cheating players can be recognized because they exhibit abnormal killing patterns. We will assume that player A realizes that player B cheats if:

1. Player B has started cheating but player A is not cheating.
2. Player B kills at least 3 other players before player A gets killed in the game.
3. At some point afterwards, player A starts cheating.

Output the count in the data.

Then, use the 20 randomizations from Part 2 to estimate the expected count of observers of cheating who start cheating. Output the mean and the 95% confidence interval for the expected count in these randomized worlds.

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

In [5]:
# Answer 3
converted_from_observed, ci_sim_converted_from_observed = output.answer_3(kills_data, cheaters_data, randomized_kills)

print(f"3.1 Count of IDs who observed a cheater that became cheaters: {converted_from_observed}")
print(
    "3.2 Simulation Results:\n"
    f"  Mean: {ci_sim_converted_from_observed['mean']}\n"
    f"  Standard Deviation: {ci_sim_converted_from_observed['std_dev']}\n"
    f"  95% Confidence Interval: [{ci_sim_converted_from_observed['ci_lower']}, {ci_sim_converted_from_observed['ci_upper']}]"
)

3.1 Count of IDs who observed a cheater that became cheaters: 203
3.2 Simulation Results:
  Mean: 46.6
  Standard Deviation: 6.6
  95% Confidence Interval: [43.7, 49.5]


---

### Evaluation

| Aspect         | Mark     | Comment   
|:--------------:|:--------:|:----------------------
| Code runs and attempts tasks     |   /20    |              
| Output 1       |   /10    | 
| Output 2       |   /10    | 
| Output 3       |   /10    | 
| Legibility     |   /10    | 
| Modularity     |   /10    | 
| Optimization   |   /30    | 
| **Total**      |**/100**  | 
