<h2>Methodology</h2>

<p>1. Identify what will be needed to get the job done, and declare our imports</p>

<p>2. Read in the file.</p>

<p>3. Format the emails to make url's.'</p>

<p>4. Use the requests library to check urls for 200 responses.</p>

<p>5. Add all results back to the dataframe and inspect for satisfactory matches.</p>

<p>6. If the websites appear to match the urls at an acceptable rate, then-  <b>wahoo!</b> If not, we have more work to do.</p>

<h3>The libraries we'll be using:</h3>

```python 
import pandas as pd
import requests
```

<h2>2. Read in the file</h2>


<p>I'm guessing someone would supply a list of emails in csv. I prefer to use pandas because I like working with dataframes. Could just as well use python's built in CSV reader and create some lists, but this is quick and nice.</p>

```python 
email_list = pd.read_csv("./emails.csv")
```

<h3>3. Format the URL's.</h3>

<p> I can't think of an email that is so bizarre as to justify starting any more complex than just checking the literal email domains for websites. So I'll just split all my emails on the '@' and see what I get.</p>
```python 
email_list['domain'] = email_list['email'].map(lambda x: x.split("@")[1]
```

<h3>4. Use the requests library to check url's for 200 responses.</h3>

<p> For this we'll define a function that returns a list.</p>

```python 
def check_responses(domain):
    url_string = "http://www." + domain
    r = requests.get(url_string)
    print(r.status_code)
    if r.status_code == 200:
        return url_string
    else:
        return "NO MATCH"
```        


<h3>5. We'll create a new column to check our results.</h3>
```python
email_list["website_url"] = email_list["domain"].apply(check_responses)

print(round(len(email_list[email_list["website_url"] == "NO MATCH"].index) / len(email_list.index),2))
```


<h2> Conclusions </h2>
<p>It's tough to say how close this would get us to a reliable list of url's. What I would need to understand is the significance of the ask, and the value prop and usage of the data.</p>
<p> Is this just speeding up our outreach and will it only be used internally? Following the 80/20 rule, this may have gotten us close enough in half an hour to add some serious value in bolstering the business development team's efforts. </p>
<p>If this is going to be shipped to a client for their personal outreach (perhaps they were given a weak email list export from a partner's Salesforce account and need help filling in the blanks), then we have some serious QA to do. How long is the list? Can someone check it manually? That almost defeats the purpose of the script. So we'll probably need to get fancy.</p>


<h2>Next steps</h2>
<p> My concern is not necessarily the negatives. We could strip out any punctuation in the domain and run it again, replacing dashes with undescores, etc, which may improve our hit rate. What I do worry about is false positives and sending a client to a porn site or something.</p>
<p>My next move would probably be to actually run some regex on those responses to see if any of the HTML can lead us to a confirmation</p>
<p> Are these emails thematically linked in any way? If we can round up a short list of key words into an array we could prettify the response content using Beautiful Soup and check for matches. If some number of matches were found on the page then we could call it safe. Again, it really depends on the use case and for whom we are grabbing this data.</p>

<p> Looking forward to discussing this and any other examples you have of the kinds of work you do, and what might be thrown at me. I'm very keen to learn how data science can impact the non-profit world.</p>