From the requests library, there are many functions that will help us. $request.head()$ and $request.get()$ seem the most useful. I will illustrate below with a bad tweet: 

In [2]:
# Using 'https://t.co/I5HajUrJ61'
import requests as rq 
r = rq.head('https://t.co/I5HajUrJ61'); 
print(r)

<Response [301]>


Using head, we find our response code. 301. From wikipedia: 

The HTTP response status code 301: Moved Permanently is used for permanent URL redirection, meaning current links or records using the URL that the response is received for should be updated.

Using 301 we can computationaly identify when the link is shortened.  

Futher more, to find the unshortened the url we have: 

In [5]:
unshortenedurl = rq.get('https://t.co/I5HajUrJ61')
print(unshortenedurl)
unshortenedurl.url

<Response [200]>


'https://twitter.com/i/web/status/1188083344724185088'

The HTTP code 200 implies that the request was successful. That is the new unshortened url points to the correct location. 
Lastly using .url will print our new extended url.


Branden's idea: Create a something that will check for the 301 condition if 301 is true, replace. 

In order to accomplish this we need to return only the code with $response.status$_$code$

In [3]:
def unshorten_url(url):
    r = rq.head(url);
    if r.status_code == 301: 
        r = rq.get(url); # Pull the full url
    return r.url # returns the url 

In [4]:
unshorten_url('https://t.co/I5HajUrJ61')

'https://twitter.com/i/web/status/1188083344724185088'

While this works when the code is 301, but there are many codes that will indicate a "shortened" url. A few examples are 302 and 300. Further research indicates a scheme for the response codes. From wiki:

1xx informational response – the request was received, continuing process

2xx successful – the request was successfully received, understood and accepted

3xx redirection – further action needs to be taken in order to complete the request

4xx client error – the request contains bad syntax or cannot be fulfilled

5xx server error – the server failed to fulfill an apparently valid request

Thus any code that begins with a 3 is a shortened url. To account for all the codes:

In [9]:
import re # regular expression library
def unshorten_url(url):
    url = url
    pattern = re.compile(r"https?://[\.A-Za-z0-9/]*\s*"); # This is a regular expression 
    match = pattern.findall(url)
    r = rq.head(match[0]);
    if int(r.status_code/100) ==  3: # int() will change any float that leads in 3 to an integer.  
        r = rq.get(match[0]); # Pull the full url
    return r.url # returns the url 

# To illustrate:
unshorten_url('https://t.co/I5HajUrJ61')

'https://twitter.com/i/web/status/1188083344724185088'

Regular expression allow us to pull the url from any string. For more info on how they work, I used this youtube video

https://www.youtube.com/watch?v=K8L6KVGG-7o

Using the defintion on Parker's code: 


In [11]:
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction

df = pd.read_csv(r"C:\Users\JungleBook\Desktop\DS Club\OpioidTwitter-master\data\opioid_tweets.csv")



with open(r"C:\Users\JungleBook\Desktop\DS Club\OpioidTwitter-master\bad_tweets.txt", "r") as f:
    ids = [i.strip() for i in f.read().split(",")] 

for item in df[df["id"].isin(ids)].iterrows():
    print()
    print("ID: {}".format(item[1]["id"]))
    print()
    print(unshorten_url(item[1]["content"]) )
# The isssue is that item[1]["Content] contains the text of tweet the url"
    print()
    print("User: {}".format(item[1]["user_name"]))
    print("_____________________________________________________________________________________________")



ID: 17942

https://www.medschat.com/Discuss/Fill-Oxycodone-Prescription-222724.htm?utm_source=twitter&utm_medium=social&utm_campaign=social_media

User: medschat
_____________________________________________________________________________________________

ID: 18456

https://www.medschat.com/Discuss/Fill-Oxycodone-Prescription-222724.htm?utm_source=twitter&utm_medium=social&utm_campaign=social_media

User: medschat
_____________________________________________________________________________________________

ID: 6047

https://twitter.com/i/web/status/1181944617341726724

User: HealthyLifePha3
_____________________________________________________________________________________________

ID: 6048

https://twitter.com/i/web/status/1181944314722766848

User: HealthyLifePha3
_____________________________________________________________________________________________

ID: 80456

https://twitter.com/account/suspended

User: EAdderall
_________________________________________________________

Thus we have a way to unshorten any url, regardless of where the url is placed withing a string. 

Concerns: While this might not work, we might need to add if statements for the other http codes(ie. 4xx, 2xx, etc).