# Project Three:  Data Mining for Personal Information from Anonymous Users

This is the secondary Jupyter notebook.  Using the best model from the first notebook's gridsearches, the remainder of the process (trying to determine a user's location based on their word choices) was conducted.

---

### Imports, function definitions, and data structures

In [200]:
#core imports
import numpy as np
import pandas as pd
import requests
import time

#set up and processing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline

#estimators and transformers
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import SVC

In [233]:
def get_pd(subreddit):
    #This is a modified function from the previous notebook to get additional scrapes.  In this function, only comments are used, and it pulls deeper than the others.
    url = f"https://api.pushshift.io/reddit/search/comment"
    
    params = {
        "subreddit":subreddit,
        "size":"500",
        "fields":["author", "body","subreddit"]
            }

    a = 0
    b = 1
    posts = []

    while b < 100:
        params["after"] = f"{b}d"
        params["before"] = f"{a}d"
        res = requests.get(url, params)
        data = res.json()
        posts+= data["data"]
        a+=1
        b+=1
        time.sleep(1)
    return pd.DataFrame(posts)

In [104]:
#Initial data pulls and assessment of results.
nyc = get_pd("nyc")
bos = get_pd("boston")
chi = get_pd("chicago")

print(nyc.shape, bos.shape, chi.shape)

---

Like in the previous notebook, there were issues with using the .replace method with pointers (in functions or for loops) so each dataframe was manually cleaned and saved to a csv.
### NYC cleanup

In [106]:
nyc = nyc.replace("", np.nan)
nyc = nyc.replace("[removed]",np.nan)
nyc = nyc.replace("[deleted]",np.nan)
nyc.dropna(inplace=True)

nyc.to_csv("./data_loc/nyc.csv", index=False)

### Boston cleanup

In [107]:
bos = bos.replace("", np.nan)
bos = bos.replace("[removed]",np.nan)
bos = bos.replace("[deleted]",np.nan)
bos.dropna(inplace=True)

bos.to_csv("./data_loc/bos.csv", index=False)

### Chicago cleanup

In [108]:
chi = chi.replace("", np.nan)
chi = chi.replace("[removed]",np.nan)
chi = chi.replace("[deleted]",np.nan)
chi.dropna(inplace=True)

chi.to_csv("./data_loc/chi.csv", index=False)

In [113]:
#Each of the three csvs are now used for drawing the dataframes from, rather than from the initial scrape. This prevents needing to scrape again if something goes awry.
nyc = pd.read_csv("./data_loc/nyc.csv")
bos = pd.read_csv("./data_loc/bos.csv")
chi = pd.read_csv("./data_loc/chi.csv")

print(nyc.shape, bos.shape, chi.shape)

(4569, 3) (4594, 3) (4570, 3)


---

## Creating the Model

This uses all of the default KNeighbors and Count Vectorizer parameters, which in the gridsearches on the core notebook produced the highest predictive ability.  However, this is now a classification issue between three subreddits, so I retested it.  The model was still yielding around 17% above the baseline.  For targeted madss advertising, an especially high degree of accuracy is not always necessary (although it is always desirable) so I decided to continue with this model despite the relatively low degree of accuracy.

In [110]:
df = pd.concat([nyc, bos, chi])

In [181]:
X = df["body"]
y = df["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [204]:
pipe_nb = Pipeline([
    ("cvec", CountVectorizer(max_df=0.9, max_features=20_000, min_df=1, stop_words="english")),
    ("nb", MultinomialNB())
     ])

pipe_nb.fit(X_train, y_train)

print(f"Training score:\t\t{pipe_nb.score(X_train, y_train)}")
print(f"Test score:\t\t{pipe_nb.score(X_test, y_test)}")

Training score:		0.7941926087748042
Test score:		0.5154714233709501


---

## Creating the corpus of known-location users

In this section, we will
1.  Get a set of users for each city;
2.  Pull in a corpus of documents from other miscellaneous subreddits; and 
3.  See what users match between the new corpus.  

This means we will have a huge dataset of posts from multiple subreddits, among which SOME but not all of the authors have been identified by location.  We will then run our model on this new corpus and see if we can predict the location of these unknown users, using the matching author IDs from the city subreddits to check our accuracy.

In [114]:
auth_chi = set(chi["author"].tolist())
auth_nyc = set(nyc["author"].tolist())
auth_bos = set(bos["author"].tolist())

In [120]:
print(len(auth_chi), len(auth_bos), len(auth_bos))

2144 2210 2210


In [234]:
ask = get_pd("funny")
gmg = get_pd("gaming")
sci = get_pd("science")

In [237]:
fun = ask
print(fun.shape, gmg.shape, sci.shape)

(9800, 3) (9800, 3) (9800, 3)


#### Clean-up & concatenation

Here, the three data frames are cleaned and then compiled into a single dataframe.  The subreddit column is unnecessary & is therefore dropped from the dataframe.  After all, we only need the authors and the words they use - their comments - from this dataframe.

In [238]:
fun = fun.replace("[removed]",np.nan)
fun = fun.replace("[deleted]",np.nan)
fun.dropna(inplace=True)

gmg = gmg.replace("[removed]",np.nan)
gmg = gmg.replace("[deleted]",np.nan)
gmg.dropna(inplace=True)

sci = sci.replace("[removed]",np.nan)
sci = sci.replace("[deleted]",np.nan)
sci.dropna(inplace=True)

corpus = pd.concat([fun, gmg, sci])
corpus.drop(columns="subreddit", inplace=True)

corpus

Unnamed: 0,author,body
0,TheDerpiestCat,u/repostsleuthbot
1,kmsae,Is this the episode of Curb where J.B. Smoove ...
2,Michael_Silveous,What?
3,uptokesforall,What was your answer
4,texxmix,I’m a grow man that’s always thought I was “be...
...,...,...
9792,KodakKid3,"Haha I totally get that, personally I’m a coll..."
9795,pdgenoa,"I appreciate that link, thanks. \n\nAnd I've h..."
9796,shroudoftheimmortal,Let me know when you find a habitable planet f...
9797,Litis3,I don't have the section before me but it was ...


### Getting user cross-sections

Here, we get the set intersections of the users from the new dataframe "corpus" and each of the three locational dataframes.  Despite the huge userbase in each category, there were surprisingly few overlaps in userbases (around 35 unique users from each city were posting in the other subreddits).  This led to some worries about model prediction quality, which I discuss further in the README and address the best I could with the model-running below.

In [239]:
auth_corpus = set(corpus["author"].tolist())

print(len(auth_corpus.intersection(auth_nyc)))
print(len(auth_corpus.intersection(auth_chi)))
print(len(auth_corpus.intersection(auth_bos)))

35
34
35


### Compiling a dataframe of known-location users

From "corpus", the new dataframe, we have ~100 users who *also* post in locational subreddits (specifically, r/boston, r/chicago, or r/nyc).  Therefore in order to test whether the model is *able to distinguish a user's location based on the content & words of their posts,* we compile a dataframe of these intersection users.  Then we can run the model we already constructed on this smaller dataframe and test to see whether or not it's actually working, because we already know what the outcome *should* be with these users.

I am aware that there is almost certainly a better way to have gotten such a dataframe from "corpus" with Python, but at this point, time constraints led me to settle on the first solution that worked rather than the most elegant solution.

In [240]:
corpus["loc"] = corpus["author"]
corpus["loc"] = corpus["loc"].map(lambda x: "nyc" if x in auth_nyc else np.nan)
nyc_corp = corpus.dropna()

corpus["loc"] = corpus["author"]
corpus["loc"] = corpus["loc"].map(lambda x: "boston" if x in auth_bos else np.nan)
bos_corp = corpus.dropna()

corpus["loc"] = corpus["author"]
corpus["loc"] = corpus["loc"].map(lambda x: "chicago" if x in auth_chi else np.nan)
chi_corp = corpus.dropna()

In [243]:
known = pd.concat([nyc_corp, bos_corp, chi_corp])
known = known.replace("AutoModerator", np.nan)
known.dropna(inplace=True)

### Testing on the known data

I ran the test initially once and got a score of around ~32%.  However, the thought occured to me that it was possible that the model just got unlucky with this run, so I tried running the model an arbitrarily large number of time with replacement, and got the mean score.  The big assumption here is that the posts I pulled from "corpus" are representative of the data I wanted to pick, which is unfortunately questionable given the relatively small amount of intersection users I had to work with.  There are a number of limits & concerns with the final outcome wh

In [248]:
meanlst = []


for i in range(1000):
    X = known["body"].sample(len(known["body"]), replace=True)
    y = known["loc"].sample(len(known["loc"]), replace=True)
    meanlst.append(pipe_nb.score(X, y))
    
print(f"Final score:\t\t{np.mean(meanlst)}")

Final score:		0.34004316546762586
