---
## The homophily and social contagion of cheating

The final assignment asks you to use the computational thinking and programming skills you learned in the course to answer an empirical social science question. You are expected to apply the best practices and theoretical concepts we covered in the course to produce a program that not only returns the correct output but is also legible, modular, and reasonably optimized. The assignment assumes mastery of loops, conditionals, and functions, as well as awareness of issues related to runtime performance.

In the assignment, we will study the homophily and social contagion of cheating in the massive multiplayer online game PlayerUnknown's Battlegrounds (PUBG). Cheating in this context means the adoption of unapproved software that gives the player an unfair advantage in the game (e.g. being able to see through walls). 

Our hypotheses are that cheaters tend to associate with other cheaters but also, players who interact with cheaters become likely to adopt cheating themselves. To provide preliminary evidence for these hypotheses, we will:

1. Observe whether cheaters tend to team up with other cheaters more than chance.
2. Observe whether players who observe cheaters are likely to become cheaters more than chance.
3. Observe whether players who are killed by cheaters are likely to become cheaters more than chance.

To test the "more than chance" part, we will simulate alternative universes in which the players played the same game but joined a different team or happened to be killed by someone else at a different time. We will then compare how what we observe in the actual data compares to what we would expect in a "randomized" world.  

**NOTE: You are only allowed to use fundamental Python data types (lists, tuples, dictionaries, numpy.ndarray, etc.) to complete this assignment.** You are not allowed to use advanced data querying and data analysis packages such as pandas, sqlite, networkx, or similar. We impose this restriction in order to test your grasp of fundamental programming concepts, not your scripting experience with Python libraries from before or from other courses you may be taking. 

#### Hints

Although this assignment is quite streamlined, imagine that the tasks here are part of a larger project. How would you structure your program if in the future you may need to use a different dataset with similar structure, manipulate the data differently, add additional analyses, or modify the focus of the current analysis?  

Keep different data manipulations in separate functions/methods and group related functions/classes in separate `.py` files. Name your modules in an informative way.

### Data

The data were collected by Jinny Kim (LSE MSc ASDS '19). The repository contains the following files:

* `cheaters.txt` – contains cheaters who played between March 1 and March 10, 2019
    1. player account id
    2. estimated date when the player started cheating
    3. date when the player's account was banned due to cheating


* `kills.txt` – contains the killings done in 6,000 randomly selected matches played between March 1 and March 10, 2019
    1. match id 
    2. account id of the killer
    3. account id of the player who got killed
    4. time when the kill happened
 
 
* `team_ids.txt` – contains the team ids for players in 5,419 team-play matches in the same period. If a match from the kills.txt file does not appear in these data, we will assume that it was in single-player mode.  
    1. match id 
    2. player account id
    3. team id in match
    
You should not modify the original data in any way. Similarly, you should not duplicate the data in this repository but instead use a relative path to access them.

### Output

The tasks ask you to output actual counts and expecteded counts (mean with 95% confidence interval). To estimate the 95% conifdence intervals, ignore the small sample size and the fact that we are dealing with count data, and simply use the approximation: 95% CI $= \mu \pm 1.96 \frac{\sigma}{\sqrt{n}}$, where $\mu$ is the mean and $\sigma$ the standard deviation of the counts in the $n=20$ randomizations. You are free to use `statsmodels` or `numpy` to calculate these values.


#### Hints

When writing your code, test it on a small "toy dataset", instead of the entire data. This way, you won't need to wait for minutes/hours just to find out that you have a syntax error!

If the randomization is time consuming, it may be worth finding a way to save the data you generate on hard disk so that you don't need to run the randomization again and again. If you decide to do so, please write your code to save any such files with processed data in the directory where this file resides. This way, we can run your code without having to alter it.

If you need to save any new data, think carefully about the most efficient way, both in terms of time and space, to save them.

## Import and run your code here

Keep your code in separate `.py` files and then import it in the code cell below. In the subsequent cells, call the functions/methods you need to conduct the requested analyses. We should be able to run all cells here to calculate again the results and get the requested output, without having to modify your code in any way. 

In [1]:
# Import modules here
import numpy as np
import math
import csv
import random
from datetime import datetime
import all_questions
import q1
import q2
import q3

**Import Datasets**

In [2]:
#create data lists
cheaters_ls = all_questions.import_cheater_data('../assignment-final-data/cheaters.txt')
team_ls = all_questions.import_team_data('../assignment-final-data/team_ids.txt')
kills_ls = all_questions.import_kills_data('../assignment-final-data/kills.txt')

In [3]:
account_cheaters = set([cheaters_ls[i][0] for i in range(len(cheaters_ls))])

**Create Match Dictionary**

In [4]:
match_dic = all_questions.create_match_dic(kills_ls, account_cheaters)

## Question One

### 1. Do cheaters team up?

Use the files `cheaters.txt` and `team_ids.txt` to estimate how often cheaters (regardless of when exactly they started cheating) end up on the same team. Your output should say how many teams have 0, 1, 2, or 4 cheaters.

Now, randomly shuffle the team ids among the players in a match. Repeat this 20 times and estimate the expected counts as before. Output the mean and the 95% confidence intervals for the expected counts. 

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

In [5]:
unique_match_teams = q1.create_unique_match_teams(team_ls)
counter_dic = {0: 0, 1:0, 2:0, 3:0, 4:0}

In [6]:
team_match_dic = q1.populate_dict(team_ls, unique_match_teams, account_cheaters)
match_cheater_team_dic = q1.populate_cheater_dic(team_ls, unique_match_teams, account_cheaters)
q1_counter_dic = q1.q1_count(counter_dic, team_match_dic)
q1.q1_output(q1_counter_dic)

170782 amount of teams have 0 cheaters on the team.
3199 amount of teams have 1 cheaters on the team.
182 amount of teams have 2 cheaters on the team.
9 amount of teams have 3 cheaters on the team.
2 amount of teams have 4 cheaters on the team.


**SHUFFLE**

In [7]:
shuffle_counter_dic = {0: 0, 1:0, 2:0, 3:0, 4:0}

mean_ls, minus_ls, plus_ls = q1.repeat_n(shuffle_counter_dic, match_cheater_team_dic, unique_match_teams, team_ls, 20)

In [8]:
q1.q1_shuffle_output(shuffle_counter_dic, mean_ls, minus_ls, plus_ls)

When players are randomized: 
170625.50 amount of teams have 0 cheaters on the team.
The 95% confidence interval for this stat is from 170613.96 to 170637.04
3499.45 amount of teams have 1 cheaters on the team.
The 95% confidence interval for this stat is from 3476.32 to 3522.58
48.60 amount of teams have 2 cheaters on the team.
The 95% confidence interval for this stat is from 36.78 to 60.42
0.45 amount of teams have 3 cheaters on the team.
The 95% confidence interval for this stat is from -0.71 to 1.61
0.00 amount of teams have 4 cheaters on the team.
The 95% confidence interval for this stat is from 0.00 to 0.00


### 2. Do victims of cheating start cheating?

Use the files `cheaters.txt` and `kills.txt` to count how many players got killed by an active cheater on at least one occasion and then started cheating. Specifically, we are interested in situations where:

1. Player B has started cheating but player A is not cheating.
2. Player B kills player A.
3. At some point afterwards, player A starts cheating.

Output the count in the data. 

Then, simulate alternative worlds in which everything is the same but the events took somewhat different sequence. To do so, randomize within a game, keeping the timing and structure of interactions but shuffling the player ids. Generate 20 randomizations like this and estimate the expected count of victims of cheating who start cheating as before. Output the mean and the 95% confidence interval for the expected count in these randomized worlds.

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

#### Hint

Starting time of cheating is estimated as a date, so assume that a player cheats on any match that they started playing on that date or afterwards. Use the match starting date so that if the match started before midnight of the cheating date but ended after midnight, we will assume that the player was not cheating just yet. 


In [9]:
#create cheater dictionary to lookup cheater start time

cheater_dic = {x[0]:x[1] for x in cheaters_ls}

In [10]:
q2_output = q2.q2_count(kills_ls, account_cheaters, cheater_dic, match_dic)

print("The number of players who start cheating after being killed by a cheater is", q2_output)

The number of players who start cheating after being killed by a cheater is 47


**SHUFFLE**

In [11]:
q2.q2_shuffle_output(match_dic, account_cheaters, cheater_dic, 20)

The mean number of players who start cheating after being killed by a cheater when players are randomized is:  14.1 The mean has a 95% confidence interval of 5.76 to 22.44


### 3. Do observers of cheating start cheating?

Use the files `cheaters.txt` and `kills.txt` to count how many players observed an active cheater on at least one occasion and then started cheating. Cheating players can be recognized because they exhibit abnormal killing patterns. We will assume that player A realizes that player B cheats if:

1. Player B has started cheating but player A is not cheating.
2. Player B kills at least 3 other players before player A gets killed in the game.
3. At some point afterwards, player A starts cheating.

Output the count in the data.

Then, use the 20 randomizations from Part 2 to estimate the expected count of observers of cheating who start cheating. Output the mean and the 95% confidence interval for the expected count in these randomized worlds.

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

In [12]:
q3_counter = q3.q3_count(match_dic, account_cheaters, cheater_dic)
q3_str = "The number of players who observed an active cheater on at least one occasion and then started cheating is"
print(q3_str, q3_counter)

The number of players who observed an active cheater on at least one occasion and then started cheating is 212


**SHUFFLE**

In [13]:
q3.q3_shuffle_output(kills_ls, match_dic, account_cheaters, cheater_dic, 20)

The mean number of players who start cheating after observing an active cheater on at least one occasion when players are randomized is: 47.6 The mean has a 95% confidence interval of 32.34 to 62.86
