# Summary

## Objective

For this human-guided machine learning problem, the challenge was to settle on useful user-created variables that could help to predict whether a URL was correctly linked with an institution of higher learning, or whether it was an error. 

## Approach

I created three types of variables based on institution names to predict potential URLs: Potential acronyms, Keywords, and Potential Abbreviations. These variables were tested in the scratch notebooks two_scratch and two_scratch_2, with key results documented here. In each case, I generate a list of possible substrings contained in the institution's URL, and then create variables that show whether any of them appear in the actual URL. These latter variables are used to predict whether a URL is correct.

## Results

Using all 3 types of variables, I am able to find a link to 91.3% of all institutions in the database of accredited centers of higher learning (where almost all URLs are correctly associated), and correctly predict whether 96/100 potential URLs in the training set really belong to the listed institution or not. Since part b further requests a probability as opposed to a simple Yes or No, I also applied a logistic regression model using my user-defined variables.


## How to use this notebook to test an arbitrary set of institution names and URLs

The easiest way is to load in your data as a dataframe (e.g. `test_df = pd.read_csv('testFile.csv')`) and then add the new dataframe to the list dfs in the third cell (`dfs = [accred_df, sample_df, test_df]`). All user-defined variables will automatically be added to this dataframe, as well as eventual prediction from both direct estimate and from logistic regression. I'm happy to write some other interface if that seems too complicated.

## Remaining Questions

I am curious about the risk of false positives that comes with generating larger lists of potential acronyms and abbreviations. In test notebook two_scratch_2, I found that adding possible abbreviations increased the percentage of predicted URLs in the accredited institution database from 89.0% to 91.3%, but that database also does not contain any incorrect matches.

In [1]:
import numpy as np
import pandas as pd
import re
pd.set_option('max_rows',200)

First, load data into pandas.

In [2]:
accred_df = pd.read_csv('Accreditation_Data.csv')
sample_df = pd.read_csv('sampleTestFile.csv')
test_df = pd.read_csv('testFile.csv')

dfs = [accred_df, sample_df, test_df]

print("Length of Accredited database:", len(accred_df))
print("Length of Training database:", len(sample_df))
print("Length of Test database:", len(test_df))

Length of Accredited database: 4383
Length of Training database: 100
Length of Test database: 392


Let's start by treating the URLs - convert everything to lowercase and get rid of prefixes and suffixes which might cause a false match.

In [3]:
## convert all URLs to lowercase first.
for df in dfs:
    df['URL'] = df['URL'].str.lower()

## define a function to reduce URLs to most useful information.
def strip_url(url):
    
    ##remove prefixes and suffixes from URL
    stripped_url = url.replace('http://','').replace('https://','').replace('www.','')
    match = re.match("(.*)\.",stripped_url)  #regular expression to strip endings such as .edu
    
    if match: 
        print(match.groups()[0])
        return match.groups()[0]
    
    #some URLs in databases are malformed - following exception prevents code from breaking.
    else:
        print('nomatch:',stripped_url)
        return stripped_url
    
for df in dfs:
    df['stripped_url'] = df['URL'].apply(strip_url)

acc-careers
acc-careers
accschool.peedeeworld
acmt
aesa
aihs
akronbeautyschool
akronschools
alexacebeauty
alliedteched
americanbeautyacademy
americancollegeofhair
antiochcollege
antonellic
aoma
artisticbeautycolleges
artisticbeautycolleges
ashdowncollege
ati.ag.ohio-state
aticareertraining
audioschool
avtec.labor.state.ak
awc
bc.inter
beacon
bealcollege
beautycareers
beautyschool
bellevue
bellin
bellinghambeautyschool
beonair
berkeley.peralta
beta.nwiht
blottsalonschools
blue.ab
boe.kana.k12.wv
boe.mono.k12.wv
boe.putn.k12.wv
bramsonort
brc
brewstertech
brioacademy
brownsontechnicalschool
bryman-college
bryman-college
burlingtontech
californiacareerschool
canadacollege
capitol-college
carnegieinstitute
carouselbeauty
carouselbeauty
carouselbeauty
cccua
cci
cci
cci
ccsce
cdtschool
ceitraining
cempr
centralcareer
centralohio.dalecarnegie
century-school.bizhosting
cetcleveland
cfinstitute
charlesstuartschool
chase
nomatch: clank85462
classact1cosmetology
coastline.cccd
collegeofhairdesign

Let's write a function to produce our first variable given an institution name - Keywords. More specifically, these are words in the Institution name beside prepositions, conjunctions and generic terms such as College, School, University etc.

In [4]:
## this function does several things at once:
## 1) we now also split on hyphens. Webpages may be hyphenated, but this way we'll still pick them up instead of missing
## them if they drop the hyphen in URL.
## 2) apostrophes aren't allowed in URLs, so we delete all apostrophes.
## 3) any generic words like College, School etc. are omitted, because they are not specific enough to lead to a match.

def get_keywords(x):
    possible_words = re.split(r'\s|-', x) ## split on either space or hyphen w/regular expression
    omit_words = ['college','institute','school','schools','university','the','of',',','-','and']
    possible_words = [ word.lower() for word in possible_words ]  ## convert everything to lower case
    
    #following line does 2 things at once: 1) drops words contained above in omit_words; 2)removes apostrophes
    resultwords  = [word.replace('\'', '') for word in possible_words if word.lower() not in omit_words]
    resultwords = list(filter(None, resultwords)) # fastest way to remove any new empty strings
    return resultwords

Here's our second variable: A list of potential acronyms based on Institution name. I originally started out with just the acronym created by taking all capitalized words, which I saved in each dataframe as the variable 'acronym'. However, I found that many schools consist of a core university name and then a location specifier (e.g. California State University - East Bay). In the end, I wound up creating up to 5 possible acronyms per institution: 1) Every capital; 2) Drop first letter; 3) Drop last letter; 4) Drop last two letter (many cities are two words); and 5) Acronym from first letter of every word, even conjunctions. The improvements from the last two were pretty marginal and number 4 may lead to more false positives, but I found repeated instances where including extra acronyms helped to find a correspondence with a URL.

In [8]:
## testing out a different version of make_acronyms that tries out a couple of other possible acronyms - will see if
## increases probability of false match too much.
def acronym(s):
    s = s.replace('The ','') #specifically handles initial The (which tends to mess up acronyms)
    return "".join(c.lower() for c in s if c.isupper())

## let's actually cycle through each of our dataframes and make a potential acronym right away.
for df in dfs:
    df['acronym'] = df['Institution'].map(acronym)

In [12]:
def make_more_acronyms(X):
    
    ## one initial acronym try - the first letter of every word, even filler words like of, the etc.
    ## low success rate, but does happen sometimes and no other way of catching such cases
    acronym_allwords = "".join(word[0].lower() for word in X['Institution'].split())
    acronym_allwords = acronym_allwords.replace('-','')
    
    if len(X['acronym']) >= 3:  ## length limit because one-letter acronyms risk false matches...
        
        acs = list([X['acronym'], X['acronym'][1:], X['acronym'][:-1]]) #full acronym and also permutations without first or last letter.
        acs.append(acronym_allwords) #
        
        if len(X['acronym']) >= 4: 
            #test for possibility that last two letters correspond to a branch campus not included in url
            acs.append(X['acronym'][:-2])
        
        return acs
    
    else: ## again, we only reach this point if the institution name's original acronym was only 2 letters.
        acs = [X['acronym'],acronym_allwords]
        return acs

The third and final variable tests for the use of abbreviations in URLs. For instance the University of Wisconsin - Madison uses wisc.edu as domain name. Therefore, I simply take the list of keywords and turn each into a potential 3-letter abbreviation.

In [13]:
def make_abbreviations(keywords):
    ## use the first three letters of each keyword as a potential abbreviation used in url
    return [ s[:3] for s in keywords ]

Let's actually create these columns in each of our dataframes. I show the head of accred_df to show what they look like.

In [14]:
for df in dfs:
    df['keywords'] = df['Institution'].apply(get_keywords)
    df['potential_acronyms'] = df.apply(make_more_acronyms, axis=1)
    df['potential_abbreviations'] = df['keywords'].apply(make_abbreviations)
    
accred_df.head()

ValueError: could not broadcast input array from shape (4) into shape (5)

In [None]:
## list comprehension that cycles through all potential acronyms and checks if any are in url.
def check_any_acronym_extra(x):
    in_url = ([ ac in x['stripped_url'] for ac in x['potential_acronyms_extra'] ])
    return(any(in_url))

In [None]:
for df in dfs:
    df['potential_acronym_in_url'] = df.apply(check_any_acronym_extra, axis=1)