# Summary

## Objective

For this human-guided machine learning problem, the challenge was to settle on useful user-created variables that could help to predict whether a URL was correctly linked with an institution of higher learning, or whether it was an error. 

## Approach

I created three types of variables based on institution names to predict potential URLs: Potential acronyms, Keywords, and potential abbreviations. These variables were tested extensively in the scratch notebooks two_scratch and two_scratch_2, with key results documented here. In each case, I generate a list of possible substrings contained in the institution's URL, and then create variables that show whether any of them appear in the actual URL. These latter variables are used to predict whether a URL is correct. If any of the three match the URL then we set the predictor matched_url to 1.

I began working on this problem with the intention of using a logistic regression model to generate probabilities, but I quickly realized both that my generated variables were skilled at finding whether urls matched institution names or not, and also that the variables I generated were highly correlated - almost every machine learning method requires a linearly independent basis of variables, which I was never going to have in my problem.

### Details
All of acronyms, keywords and abbreviations consisted of lists with different potential permutations that might appear in URLs. In particular, I found from perusing the data that many institution names consisted of a central university followed by a local branch, while the URL would only contain the central university's name. Therefore, I generated potential acronyms in which the final or final two letters were omitted (ostensibly corresponding to place names). The downside, especially with two-letter acronyms, is the potential for false positives, but the improvement in matching seemed to justify the risk. Using simple acronyms correctly predicted 67/100 training results versus 68/100 with an enhanced set of acronyms, so perhaps the additional complexity is not warranted.

## Results

Using all 3 types of variables, I am able to find a link to 90.8% of all institutions in the database of accredited centers of higher learning (where almost all URLs are correctly associated), and correctly predict whether 96/100 potential URLs in the training set really belong to the listed institution or not. A further visual inspection of the accredited institution database suggests that many of the remaining URLs were not really guessable from institution names.

## How to use this notebook to test an arbitrary set of institution names and URLs

The easiest way is to load in your data as a dataframe (e.g. `test_df = pd.read_csv('testFile.csv')`) and then add the new dataframe to the list dfs in the third cell (`dfs = [accred_df, sample_df, test_df]`). All user-defined variables will automatically be added to this dataframe, as well as eventual prediction from both direct estimate and from logistic regression. I'm happy to write some other interface if that seems too complicated.

I've also saved the results from this analysis as a pickle file stored in this github repository (two_accred.pkl, two_sample.pkl and two_test.pkl).

## Remaining Questions

I am curious about the risk of false positives that comes with generating larger lists of potential acronyms and abbreviations. In test notebook two_scratch_2, I found that adding possible abbreviations increased the percentage of predicted URLs in the accredited institution database from 89.0% to 91.3%, but that database also does not contain any incorrect matches.

In [1]:
import numpy as np
import pandas as pd
import pickle
import re
pd.set_option('max_rows',200)

First, load data into pandas.

In [2]:
accred_df = pd.read_csv('Accreditation_Data.csv')
sample_df = pd.read_csv('sampleTestFile.csv')
test_df = pd.read_csv('testFile.csv')

dfs = [accred_df, sample_df, test_df]

print("Length of Accredited database:", len(accred_df))
print("Length of Training database:", len(sample_df))
print("Length of Test database:", len(test_df))

Length of Accredited database: 4383
Length of Training database: 100
Length of Test database: 392


Let's start by treating the URLs - convert everything to lowercase and get rid of prefixes and suffixes which might cause a false match. I'm also going to take out all hyphens because they can prevent us from spotting a potential acronym match.

In [3]:
## convert all URLs to lowercase first.
for df in dfs:
    df['URL'] = df['URL'].str.lower()

## define a function to reduce URLs to most useful information.
def strip_url(url):
    
    ##remove prefixes and suffixes from URL
    stripped_url = url.replace('http://','').replace('https://','').replace('www.','')
    match = re.match("(.*)\.",stripped_url)  #regular expression to strip endings such as .edu
    
    if match: 
        url_out = match.groups()[0].replace('-','')
        print(url_out)
        return url_out
    
    #some URLs in databases are malformed - following exception prevents code from breaking.
    else:
        print('nomatch:',stripped_url)
        return stripped_url
    
for df in dfs:
    df['stripped_url'] = df['URL'].apply(strip_url)

acccareers
acccareers
accschool.peedeeworld
acmt
aesa
aihs
akronbeautyschool
akronschools
alexacebeauty
alliedteched
americanbeautyacademy
americancollegeofhair
antiochcollege
antonellic
aoma
artisticbeautycolleges
artisticbeautycolleges
ashdowncollege
ati.ag.ohiostate
aticareertraining
audioschool
avtec.labor.state.ak
awc
bc.inter
beacon
bealcollege
beautycareers
beautyschool
bellevue
bellin
bellinghambeautyschool
beonair
berkeley.peralta
beta.nwiht
blottsalonschools
blue.ab
boe.kana.k12.wv
boe.mono.k12.wv
boe.putn.k12.wv
bramsonort
brc
brewstertech
brioacademy
brownsontechnicalschool
brymancollege
brymancollege
burlingtontech
californiacareerschool
canadacollege
capitolcollege
carnegieinstitute
carouselbeauty
carouselbeauty
carouselbeauty
cccua
cci
cci
cci
ccsce
cdtschool
ceitraining
cempr
centralcareer
centralohio.dalecarnegie
centuryschool.bizhosting
cetcleveland
cfinstitute
charlesstuartschool
chase
nomatch: clank85462
classact1cosmetology
coastline.cccd
collegeofhairdesign
colmiz

Let's write a function to produce our first variable given an institution name - Keywords. More specifically, these are words in the Institution name beside prepositions, conjunctions and generic terms such as College, School, University etc.

In [4]:
## this function does several things at once:
## 1) we now also split on hyphens. Webpages may be hyphenated, but this way we'll still pick them up instead of missing
## them if they drop the hyphen in URL.
## 2) apostrophes aren't allowed in URLs, so we delete all apostrophes.
## 3) any generic words like College, School etc. are omitted, because they are not specific enough to lead to a match.

def get_keywords(x):
    possible_words = re.split(r'\s|-', x) ## split on either space or hyphen w/regular expression
    omit_words = ['college','institute','school','schools','university','the','of',',','-','and']
    possible_words = [ word.lower() for word in possible_words ]  ## convert everything to lower case
    
    #following line does 2 things at once: 1) drops words contained above in omit_words; 2)removes apostrophes
    resultwords  = [word.replace('\'', '') for word in possible_words if word.lower() not in omit_words]
    resultwords = list(filter(None, resultwords)) # fastest way to remove any new empty strings
    return resultwords

Here's our second variable: A list of potential acronyms based on Institution name. I originally started out with just the acronym created by taking all capitalized words, which I saved in each dataframe as the variable 'acronym'. However, I found that many schools consist of a core university name and then a location specifier (e.g. California State University - East Bay). In the end, I wound up creating up to 5 possible acronyms per institution: 1) Every capital; 2) Drop first letter; 3) Drop last letter; 4) Drop last two letter (many cities are two words); and 5) Acronym from first letter of every word, even conjunctions. The improvements from the last two were pretty marginal and number 4 may lead to more false positives, but I found repeated instances where including extra acronyms helped to find a correspondence with a URL.

Let's define a couple of basic acronym functions first:

In [5]:
#takes capitals, makes acronym
def acronym(s):
    s = s.replace('The ','') #specifically handles initial The (which tends to mess up acronyms)
    return "".join(c.lower() for c in s if c.isupper())

#takes first letter of each word - useful if institution name is wrong (for instance all-caps)
def acronym_allwords(s):
    ac = "".join(word[0].lower() for word in s.split())
    return ac.replace('-','') #get rid of any hyphens

In [6]:
## uses row from dataframe to produce more possible acronyms.
def get_acronyms(x):
    
    ac = acronym(x)
    ac_allwords = acronym_allwords(x)
    acronyms = [ac,ac_allwords]
    
    if len(ac) >= 3:  ## length limit because one-letter acronyms risk false matches...
        
        acronyms.append(ac[1:])
        acronyms.append(ac[:-1])
        
        if len(ac) >= 4:
            acronyms.append(ac[:-2])
        
    print(acronyms)
    return acronyms

The third and final variable tests for the use of abbreviations in URLs. For instance the University of Wisconsin - Madison uses wisc.edu as domain name. Therefore, I simply take the list of keywords and turn each into a potential 3-letter abbreviation.

In [7]:
def get_abbreviations(keywords):
    ## use the first three letters of each keyword as a potential abbreviation used in url
    return [ s[:3] for s in keywords ]

Let's actually create these columns in each of our dataframes.

In [8]:
## I originally tried to add all these columns in the same for loop, but then realized that this
## was actually a substantial error since I would be trying to edit the iterator...

for df in dfs:
    df['keywords'] = df['Institution'].apply(get_keywords) #keywords

for df in dfs:
    df['potential_acronyms'] = df['Institution'].apply(get_acronyms) #acronyms

for df in dfs:
    df['potential_abbreviations'] = df['keywords'].apply(get_abbreviations) #abbreviations

['acct', 'accot', 'cct', 'acc', 'ac']
['acca', 'acca', 'cca', 'acc', 'ac']
['acc', 'acoc', 'cc', 'ac']
['acmt', 'acomt', 'cmt', 'acm', 'ac']
['aesac', 'aaesoac', 'esac', 'aesa', 'aes']
['auhs', 'auohs', 'uhs', 'auh', 'au']
['gabs', 'gabs', 'abs', 'gab', 'ga']
['avs', 'avs', 'vs', 'av']
['aabc', 'aaobc', 'abc', 'aab', 'aa']
['fis', 'fis', 'is', 'fi']
['abaw', 'abaw', 'baw', 'aba', 'ab']
['achcr', 'acohcr', 'chcr', 'achc', 'ach']
['ac', 'ac']
['ach', 'ach', 'ch', 'ac']
['aoma', 'aoomaa', 'oma', 'aom', 'ao']
['ebsg', 'ebsg', 'bsg', 'ebs', 'eb']
['ebsw', 'ebsw', 'bsw', 'ebs', 'eb']
['achs', 'acohs', 'chs', 'ach', 'ac']
['osuati', 'osuati', 'suati', 'osuat', 'osua']
['atictcm', 'actcm', 'tictcm', 'atictc', 'atict']
['iar', 'ioar', 'ar', 'ia']
['avtc', 'avtc', 'vtc', 'avt', 'av']
['awc', 'awc', 'wc', 'aw']
['iauprb', 'iauoprb', 'auprb', 'iaupr', 'iaup']
['bu', 'bu']
['bc', 'bc']
['cab', 'caob', 'ab', 'ca']
['cahd', 'caohd', 'ahd', 'cah', 'ca']
['bu', 'bu']
['bcbhsrt', 'bcbhsort', 'cbhsrt', '

Here's what our dataframes currently look like.

In [9]:
accred_df.head()

Unnamed: 0,Institution,URL,stripped_url,keywords,potential_acronyms,potential_abbreviations
0,American Commercial College of Texas,acc-careers.com,acccareers,"[american, commercial, texas]","[acct, accot, cct, acc, ac]","[ame, com, tex]"
1,American Commercial College - Abilene,acc-careers.com,acccareers,"[american, commercial, abilene]","[acca, acca, cca, acc, ac]","[ame, com, abi]"
2,Anson College of Cosmetology,accschool.peedeeworld.net,accschool.peedeeworld,"[anson, cosmetology]","[acc, acoc, cc, ac]","[ans, cos]"
3,American College of Medical Technology,acmt.ac/,acmt,"[american, medical, technology]","[acmt, acomt, cmt, acm, ac]","[ame, med, tec]"
4,Aviation and Electronic Schools of America - C...,aesa.com,aesa,"[aviation, electronic, america, colfax]","[aesac, aaesoac, esac, aesa, aes]","[avi, ele, ame, col]"


Now let's make the actual predictive variables that check whether any of the acronyms, keywords or abbreviations show up in the supposed URL.

In [10]:
## list comprehension that cycles through all potential acronyms and checks if any are in url.
def check_any_keyword_in_url(x):
    in_url = ([ ac in x['stripped_url'] for ac in x['keywords'] ])
    return(any(in_url))

def check_any_acronym_in_url(x):
    in_url = ([ ac in x['stripped_url'] for ac in x['potential_acronyms'] ])
    return(any(in_url))

def check_any_abbrev_in_url(x):
    in_url = ([ ab in x['stripped_url'] for ab in x['potential_abbreviations'] ])
    return(any(in_url))

funcs = [check_any_keyword_in_url,check_any_acronym_in_url,check_any_abbrev_in_url]
cols = ['keyword_in_url','acronym_in_url','abbrev_in_url']

In [11]:
## create each of the columns above in each dataframe
for df in dfs:
    for col, fun in zip(cols,funcs):
        df[col] = df.apply(fun, axis=1)

accred_df.head()

Unnamed: 0,Institution,URL,stripped_url,keywords,potential_acronyms,potential_abbreviations,keyword_in_url,acronym_in_url,abbrev_in_url
0,American Commercial College of Texas,acc-careers.com,acccareers,"[american, commercial, texas]","[acct, accot, cct, acc, ac]","[ame, com, tex]",False,True,False
1,American Commercial College - Abilene,acc-careers.com,acccareers,"[american, commercial, abilene]","[acca, acca, cca, acc, ac]","[ame, com, abi]",False,True,False
2,Anson College of Cosmetology,accschool.peedeeworld.net,accschool.peedeeworld,"[anson, cosmetology]","[acc, acoc, cc, ac]","[ans, cos]",False,True,False
3,American College of Medical Technology,acmt.ac/,acmt,"[american, medical, technology]","[acmt, acomt, cmt, acm, ac]","[ame, med, tec]",False,True,False
4,Aviation and Electronic Schools of America - C...,aesa.com,aesa,"[aviation, electronic, america, colfax]","[aesac, aaesoac, esac, aesa, aes]","[avi, ele, ame, col]",False,True,False


Let's already get a sense of how often each of these things happens. Obviously, these are not mutually exclusive from one another, as shown below. In fact, this is pretty problematic for trying to perform any kind of logistic regression since individual variables are supposed to be independent (I'll give it a shot anyway). Abbreviations only offer a marginal extra bit that keywords do not, while a url is very unlikely to combine both an acronym and a keyword/abbreviation.

In [12]:
print(accred_df['keyword_in_url'].value_counts(normalize = True))
print(accred_df['acronym_in_url'].value_counts(normalize = True))
print(accred_df['abbrev_in_url'].value_counts(normalize = True))

accred_df[['keyword_in_url','acronym_in_url','abbrev_in_url']].corr()

True     0.558522
False    0.441478
Name: keyword_in_url, dtype: float64
False    0.578827
True     0.421173
Name: acronym_in_url, dtype: float64
True     0.599133
False    0.400867
Name: abbrev_in_url, dtype: float64


Unnamed: 0,keyword_in_url,acronym_in_url,abbrev_in_url
keyword_in_url,1.0,-0.572329,0.920034
acronym_in_url,-0.572329,1.0,-0.579878
abbrev_in_url,0.920034,-0.579878,1.0


Now let's define a variable "matched_url" that estimates how often we have a match between one of our three parameters and the url. Right away we can see that our three variables are hitting 90.8% of institutions in the database of accredited centers of learning, in which almost all URLs are correct. Clearly, we already have good predictivity.

In [13]:
names = ['Accredited:','Training:','Test:']

for df, name in zip(dfs,names):
    df['matched_url'] = df['keyword_in_url'] | df['acronym_in_url'] | df['abbrev_in_url']
    print(name,'\n',df['matched_url'].value_counts(normalize = True))

Accredited: 
 True     0.908282
False    0.091718
Name: matched_url, dtype: float64
Training: 
 True     0.54
False    0.46
Name: matched_url, dtype: float64
Test: 
 True     0.548469
False    0.451531
Name: matched_url, dtype: float64


A browse through the list of URLs that we failed to match below suggests that many of them would be pretty much impossible to guess without human triage.

In [14]:
accred_df[accred_df['matched_url'] == False]

Unnamed: 0,Institution,URL,stripped_url,keywords,potential_acronyms,potential_abbreviations,keyword_in_url,acronym_in_url,abbrev_in_url,matched_url
5,American University of Health Sciences,aihs.edu,aihs,"[american, health, sciences]","[auhs, auohs, uhs, auh, au]","[ame, hea, sci]",False,False,False,False
7,Adult Vocational Services,akronschools.com,akronschools,"[adult, vocational, services]","[avs, avs, vs, av]","[adu, voc, ser]",False,False,False,False
9,Fortis Institute - Scranton,alliedteched.edu,alliedteched,"[fortis, scranton]","[fis, fis, is, fi]","[for, scr]",False,False,False,False
27,Career Academy of Hair Design,beautyschool.edu,beautyschool,"[career, academy, hair, design]","[cahd, caohd, ahd, cah, ca]","[car, aca, hai, des]",False,False,False,False
31,Ohio Center for Broadcasting - Colorado Campus,beonair.com,beonair,"[ohio, center, for, broadcasting, colorado, ca...","[ocbcc, ocfbcc, cbcc, ocbc, ocb]","[ohi, cen, for, bro, col, cam]",False,False,False,False
34,Ohio State School of Cosmetology,blottsalonschools.com,blottsalonschools,"[ohio, state, cosmetology]","[ossc, ossoc, ssc, oss, os]","[ohi, sta, cos]",False,False,False,False
36,Ben Franklin Career & Technical Education Center,boe.kana.k12.wv.us/,boe.kana.k12.wv,"[ben, franklin, career, &, technical, educatio...","[bfctec, bfc&tec, fctec, bfcte, bfct]","[ben, fra, car, &, tec, edu, cen]",False,False,False,False
42,Marinello School of Beauty,brioacademy.com,brioacademy,"[marinello, beauty]","[msb, msob, sb, ms]","[mar, bea]",False,False,False,False
44,Everest College - Los Angeles,bryman-college.com,brymancollege,"[everest, los, angeles]","[ecla, ecla, cla, ecl, ec]","[eve, los, ang]",False,False,False,False
45,Everest College - Torrance,bryman-college.com,brymancollege,"[everest, torrance]","[ect, ect, ct, ec]","[eve, tor]",False,False,False,False


Effectively matched_url is our first predictor. By looking at the training set and comparing with isWrong, we can gauge how good our variables are at predicting - in this case, we correctly classify 96 out of 100 URLs. I also show the ability of each individual variable to make correct predictions about whether urls are associated. Given the below, one would guess that generating a list of abbreviations is highly superfluous, especially given the high correlation with keywords.

In [15]:
sample_df['correct_guess'] = sample_df['isWrong'] == ~sample_df['matched_url']
print('All 3:',sample_df['correct_guess'].value_counts())

correct_guess_keyword = sample_df['isWrong'] == ~sample_df['keyword_in_url']
print('\nKeywords only:',correct_guess_keyword.value_counts())

correct_guess_acronym = sample_df['isWrong'] == ~sample_df['acronym_in_url']
print('\nAcronyms only:',correct_guess_acronym.value_counts())

correct_guess_abbrev = sample_df['isWrong'] == ~sample_df['abbrev_in_url']
print('\nAbbreviations only:',correct_guess_abbrev.value_counts())

All 3: True     96
False     4
Name: correct_guess, dtype: int64

Keywords only: True     82
False    18
dtype: int64

Acronyms only: True     68
False    32
dtype: int64

Abbreviations only: True     83
False    17
dtype: int64


However, adding abbreviations does seem to be catching additional URLs in the accredited database, raising the number of URL matches from 88.5% to 90.8%, as shown below. Therefore, I include it in my final model.

In [16]:
accred_df['keyword_plus_acronym'] = accred_df['acronym_in_url'] | accred_df['keyword_in_url']

print('All 3:',accred_df['matched_url'].value_counts(normalize = True))
print('\nKeywords and Acronyms only:',accred_df['keyword_plus_acronym'].value_counts(normalize = True))
print('\nKeywords only:',accred_df['keyword_in_url'].value_counts(normalize = True))
print('\nAcronyms only:',accred_df['acronym_in_url'].value_counts(normalize = True))
print('\nAbbreviations only:',accred_df['abbrev_in_url'].value_counts(normalize = True))

All 3: True     0.908282
False    0.091718
Name: matched_url, dtype: float64

Keywords and Acronyms only: True     0.884782
False    0.115218
Name: keyword_plus_acronym, dtype: float64

Keywords only: True     0.558522
False    0.441478
Name: keyword_in_url, dtype: float64

Acronyms only: False    0.578827
True     0.421173
Name: acronym_in_url, dtype: float64

Abbreviations only: True     0.599133
False    0.400867
Name: abbrev_in_url, dtype: float64


A look below at the cases we misclassified in the training set shows the first evidence of false positives - our overzealous generation of acronyms caused a false match with Sunstate Academy and General Theological Seminary, while the use of abbreviations caused a false match between Newbridge College and newpaltz.edu.

In [17]:
sample_df[sample_df['correct_guess'] == False]

Unnamed: 0,Institution,URL,isWrong,stripped_url,keywords,potential_acronyms,potential_abbreviations,keyword_in_url,acronym_in_url,abbrev_in_url,matched_url,correct_guess
39,Newbridge College - El Cajon,www.newpaltz.edu,1,newpaltz,"[newbridge, el, cajon]","[ncec, ncec, cec, nce, nc]","[new, el, caj]",False,False,True,True,False
42,New Mexico Institute of Mining and Technology,www.nmt.edu,0,nmt,"[new, mexico, mining, technology]","[nmimt, nmiomat, mimt, nmim, nmi]","[new, mex, min, tec]",False,False,False,False,False
58,The General Theological Seminary,www.remingtoncollege.edu,1,remingtoncollege,"[general, theological, seminary]","[gts, tgts, ts, gt]","[gen, the, sem]",False,True,False,True,False
68,Sunstate Academy,www.sandiegojobcorps.org,1,sandiegojobcorps,"[sunstate, academy]","[sa, sa]","[sun, aca]",False,True,False,True,False


In conclusion, I lock in matched_url (or rather, its negation) as my final predictor. I've also saved my work as pickle files.

In [18]:
for df in dfs:
    df['Prediction'] = ~df['matched_url']
    
accred_df.to_pickle('two_accred.pkl')
sample_df.to_pickle('two_sample.pkl')
test_df.to_pickle('two_test.pkl')

## Final thoughts

In the end, I chose not to resort to more complex machine learning methods because I didn't feel they were necessary, and the variables I had developed were ill-suited to methods such as logistic regression because they are not independent from one another. The training set is also small enough that I didn't feel that I could gain a lot of information on how to tune my model (i.e. how aggressively to generate potential acronyms, for instance). I also failed to generate a probability of accuracy as requested in part b, but essentially my predictions are strongly bimodal - if there is a match with the url then I find it is very likely to be the correct url, and likewise if there is no match.

Thank you for presenting this problem - I had not worked much with natural language previously and enjoyed the challenge.