# Quizmaster by Philip Lassen




## Classification Component

In order to classify the questions based on category, I created a web-crawl. The web crawl logic is remarkably simple yet achieves quite good results.  (Note that it is important that out.csv, question_answer.csv and crowd.tsv are all in the same directory for the code to run) 

In [1]:
import operator
import sys
import random
import numpy as np
import csv
import time
import pandas as pd
from googlesearch import search

The web crawl code works by simply choosing a handful of words that correspond to each of the categories. Then for each question in the file it pulls up 15-40 of the urls from a google search of the question. It then checks each of these urls for an occurance of any of the keywords. Whichever category has the most occurences of keywords in the URLs is then chosen as the topic label.

In [7]:
index = 0
count = 0
music_words = {"song", "single", "album", "artist", "singer", "music"}
sport_words = {"sport", "olympics", "medal", "athlete", "competition", "win"}
game_words = {"xbox", "playstation", "game", "nintendo", "wii"}
kids_words = {"kid", "series", "history", "child", "happy", "cartoon"}
technology_words = {"science", "chem", "physic", "math", "computer"}
word_map = {"sports" : sport_words, "music" : music_words, "video-games" : 
            game_words, "for-kids" : kids_words, "science-technology" : technology_words}
count_map = {topic : 0 for topic in word_map.keys()}

# Code takes hours to run, if False is set to True
Run = False
if Run:
    test_data =  pd.read_csv("question_answer.csv", header = None, names = ["Question", "Answer"], sep=";")
    wtr = csv.writer(open ('out.csv', 'a'), delimiter=',', lineterminator='\n')
    q_count = index
    time_start = time.time()
    for q in test_data["Question"][index:]:
      q_count += 1
      print("Question Number: " + str(q_count), file = open("crawl_log.txt", "a"))
      print(q, file = open("crawl_log.txt", "a"))
      count_map = {topic : 0 for topic in word_map.keys()}
      url_num = random.randint(15, 41)
      wait_time = round(random.randrange(0, 1), 2) + random.randint(1, 4)
      result = search(q, num = url_num, pause = wait_time, stop = url_num)
      for r in result:
        for (topic, words) in word_map.items():
          count = 0
          for w in words:
            if w in r.lower():
              count += 1
          count_map[topic] += count
      label = max(count_map.items(), key=operator.itemgetter(1))[0]
      print(label , file = open("crawl_log.txt", "a"))
      print(count_map, file = open("crawl_log.txt", "a"))
      wtr.writerow([label])

One of the issues with the webcrawl is that Google does not approve of webcrawls and thus sporadically denies access to their search engine after too many requests. Being mindful of this I chose to only crawl every 1-6 seconds. However Google would still deny access at some points. As a work around I logged the last succesful labelling into the file crawl_log.txt. So when the code throws an error, we have the latest succesful labeling. I then wrote a bash script to find the last label and start the program again from where it was previously denied access. I made sure to be mindful of Google, and would only rerun the program every hour, until the webcrawl was succesfully complete. 

```
#!/usr/bin/env bash
LINE="$(tail -n2 crawl_log.txt | head -n1 | rev | cut -d' ' -f1 | rev)"
while [ $LINE -lt 570 ]
do
  python3 crawl.py -v -n $LINE
  echo "going to sleep"
  sleep 1800
  LINE="$(tail -n2 crawl_log.txt | head -n1 | rev | cut -d' ' -f1 | rev)"
done
```

My accuracy from the competition is shown below

12 | vgh804 | 0.705263157895 | 0.724561403509

Considering the simplicity of the web crawl the results are quite impressive. The biggest issue is that it was not practical to make improvements to the web crawl due to Google often denying access to their data. 

# Majority Voting

I defined a simple function majority_vote that takes an array of votes and the number of possible outcomes and then calculates the majority vote. The code below calculates a new pandas DataFrame with the majority vote applied to factuality, opinion, difficulty to form new columns of a dateframe, and also labels the questions with the categories from the webcrawl.

In [2]:
data = pd.read_csv("crowd.tsv", encoding = "iso-8859-1", sep = '\t')
result = data.groupby(["question"])
count = 0

def category_to_int(cat):
  if cat == "Easy":
    return 1
  if cat == "Medium":
    return 2
  if cat == "Hard":
    return 3

data["difficulty"] = data["difficulty"].apply(category_to_int)

def majority_vote(votes, number_of_outcomes):
  a = [0] * (number_of_outcomes + 1)
  for v in votes:
    a[v] += 1
  return a.index(max(a))

test_data =  pd.read_csv("question_answer.csv", header = None, names = ["Question", "Answer"], sep=";")
test_labels = pd.read_csv("out.csv", header = None, names = ["Category"])
fdf = pd.DataFrame({"question" : [], "difficulty" : [], "opinion" : [], "factuality" : [], "answer": [], "category" : []})

for r in result:
  q = (r[0])
  idx_frame = (test_data[test_data["Question"] == q])
  if not idx_frame.empty:
    idx_frame = idx_frame.iloc[[0]]
    idx = idx_frame.index.item()
    ans = test_data.iloc[[idx]]["Answer"].item()
    cat = test_labels.iloc[[idx]]["Category"].item()
    dif = majority_vote(r[1]["difficulty"].values.tolist(), 3)
    op = majority_vote(r[1]["opinion"].values.tolist(), 3)
    fac = majority_vote(r[1]["factuality"].values.tolist(), 2)
    temp = {"question" : q, "difficulty" : dif, "opinion" : op, "factuality" : fac, "answer" : ans, "category" : cat}
    fdf =fdf.append(temp, ignore_index = True)

cleaned_data = fdf
cleaned_data

Unnamed: 0,question,difficulty,opinion,factuality,answer,category
0,A type glass that is highly resistant to heat?,3.0,1.0,1.0,Borosilicate Glass,for-kids
1,7 rings' is a song by which American singer?,2.0,3.0,0.0,Ariana Grande,music
2,A device used to measure the strength of magne...,3.0,1.0,0.0,Magnetometer,science-technology
3,A winner in Boxing (Light Flyweight) at the 20...,3.0,1.0,0.0,Zou Shiming,sports
4,An ordinary bit can have two states. How many ...,2.0,2.0,0.0,Infinite(Bounded by the Bloch sphere),science-technology
5,"At what age did Jimmy Hendrix, Janis Joplin an...",1.0,3.0,0.0,27,music
6,"Besides Celtic and Ranger, which club has most...",3.0,1.0,0.0,Aberdeen,sports
7,Can you finish the lyric of the song by Backst...,3.0,2.0,0.0,Your expression,music
8,"Earth, Wind and ____?",2.0,3.0,0.0,Fire,sports
9,Football is played primarily using which body ...,1.0,2.0,0.0,The foot.,sports


There may be a more idiomatic Pandas approach for implementing the merging of tables and performing joins. But the code above does get the desired results.

## Convergence Component

The convergence Component starts by assigning all the categories an equal score. As the scores improve or worsen, the relative probability of being asked a question from the category changes. This is done by assigning each category a normalized probability based on its accuracy of answering questions. The higher the accuracy the more likely the question is asked. Once the variations in accuracy are sufficently small or after a specified number of iterations, the category with the highest accuracy is selected.

In [6]:
data = cleaned_data
initial_scores = [1, 1]
scores= {"sports" : [1, 1], "music" : [1, 1], "video-games" : [1, 1], "for-kids" : [1, 1], "science-technology" : [1, 1]}


def choose_cat(scores):
  accuracy = {c : sum(scores[c])/ len(scores[c]) for c in scores.keys()}
  total = sum(list(accuracy.values()))
  tot = 0
  probs = {c : accuracy[c] / total for c in scores.keys()}
  thresh = {}
  for key, value in probs.items():
    tot += value
    thresh[key] = tot
  r = np.random.uniform()
  for key, value in thresh.items():
    if r < value:
      return key

def qa(question, answer):
  print(question)
  guess = input("What is the answer? : ")
  return int(guess == answer)

def qa_row(series):
  return qa(series["question"], series["answer"])

n = 0 #number of total questions asked
def is_double(scores):
  for k in scores.keys():
    is_double = True
    for j in scores.keys():
      if scores[k] < 2 * scores[j] and k != j:
        is_double = False
    if is_double:
      return True
  return False

#Change to True if you want to put input
#This will halt later Componenents from running
run = False
if run:
    while (not is_double(scores) and n < 25):
      n += 1
      c = choose_cat(scores)
      df = data.loc[data["category"] == c]
      row = random.randint(0, df.shape[0] - 1)
      scores[c] += [qa_row(df.iloc[row])]

def get_max_key(scores):
  return max(scores.items(), key=operator.itemgetter(1))[0]
if run:

    cat = get_max_key(scores)
    df_cat = data.loc[data["category"] == cat]
    while (True):
      row = random.randint(0, df_cat.shape[0] - 1)
      qa_row(df_cat.iloc[row])

# Friend Recommender

To implement the Friend Recommender I used the Pearson Correlation. I first normalized the data by subtracting the mean opinion of a user from all their opinions. Afterwards I filled the NaN values with 0. Then Cosine Similarity is applied. And finally we sort and get the indices of the max values from the user. 

In [4]:
df = pd.read_csv("crowd.tsv", encoding = "iso-8859-1", sep = '\t')
df = df[['id', 'question', 'opinion']]
mean = df.groupby(['id'], as_index = False, sort = False).mean().rename(columns = {'opinion' : 'opm'})[['id', 'opm']]
Op = pd.merge(df, mean, on = 'id', how = 'left', sort = False)
Op['adjusted'] = Op['opinion'] - Op['opm']
result = pd.DataFrame({'id':Op['id'], "question": Op['question'], "opinion" : Op['adjusted']})
table = result.pivot_table(index = 'id', columns = 'question', values = 'opinion').fillna(0)
table
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(table)
resulty = np.argsort(cosine_similarity(table))
user = 10
k = 4
np.flip(resulty[10, -(k+1): 37])

array([23, 13, 12,  8])

I struggled with implementing this Friend Recommendation and followed the following guide for help with Pandas features needed for implementing the recomendation system https://medium.com/@sam.mail2me/recommendation-systems-collaborative-filtering-just-with-numpy-and-pandas-a-z-fa9868a95da2.

# User Difficulty Component

The Difficulty Component is quite simple. It simply uses the data frame from the Majority Vote in part 1 as we also used majority vote to classify the difficulty of questions. Then we simply choose a question at random for a given difficulty until the user answers correctly in which case we repeat the proccess but for the next difficulty level.

In [5]:
print("Welcome to the ultimate Quiz")
print("")
print("We will ask you Questions of increasing difficulty")
print("You must pass all three levels to WIN THE GAME")

difficulty_map = { 1 : "Easy", 2 : "Medium", 3 : "Hard"}

level = 1
while (level < 4):
  print("You are on Level : " + difficulty_map[level])
  df = data.loc[data["difficulty"] == level]
  row = random.randint(0, df.shape[0] - 1)
  level += qa_row(df.iloc[row])
    
print("YOU HAVE COMPLETE THE GAME")

Welcome to the ultimate Quiz

We will ask you Questions of increasing difficulty
You must pass all three levels to WIN THE GAME
You are on Level : Easy
What instrument does a pianist play?
What is the answer? : piano
You are on Level : Easy
What was the Olympic city of 1992 ?
What is the answer? : Barcelona
You are on Level : Medium
When was The Beatles formed?
What is the answer? : ewgk
You are on Level : Medium
Which is the second most polular sport in the world (2018)?
What is the answer? : Cricket
You are on Level : Hard
Which artist made the song "You belong with me"
What is the answer? : Taylor Swift
YOU HAVE COMPLETE THE GAME


# Simulator

The Simulator takes a command line argument when the game is being played. If the '-s' flag is passed then the Simulator mode is turned on. This is done by slightly altering one of the helper functions, in which case the code uses a player profile to choose how the questions are answered.
```
SIM = False
DEBUG = False
if '-s' in sys.argv:
  SIM = True
 
.....

def qa_row(series):
  if SIM:
    categ = series["category"]
    pr = player_profile[categ]
    return np.random.uniform() < pr
  return qa(series["question"], series["answer"])
```


A player profile is a map from category to the probability of the player knowing the result an example is shown below

```
player_profile = {"sports" : .6, "music" : .9, "video-games" : .8, "for-kids" : .7 , "science-technology" : .9}
```

The Simulator Componenet is dependent on command line arguements thus it will be shown in the demonstration and can't be run in the notebook.

# Notes

I implemented the Quizmaster in general python files. I used features such as command line arguements to change modes and take user input. Not all of these features translated to the Jupyter Notebook. 
