# Setting up a Tinder Scraper

This notebook will cover how to go about scraping Tinder through its web applications using Python's `Selenium` library. The scraper was developed by me and and my team as part of an unsupervised learning analysis of the numerical and text data from 15000 Tinder profiles. This notebook only covers the web scraping portion of that project.

Tinder is a rich souce of social data, allowing retrieval of the age, (inferredm more on this later) sexual orientation, occupation, education, favorite artists, among others. Also, it's just very interesting.

# Selenium and the Web Scraping Process

We use `Selenium` for web scraping since we need to interact directly with Tinder's web application to properly scrape data. In a sense, we're creating a bot that does the following actions in this exact sequence:
1. Swipe up - opens the rest of the profile for viewing
2. Retrieve all Information - Swiping up loads the HTML containing all the information we need to scrape
3. Swipe left - we don't exactly want to match with anyone unless you want to make extra lines of code for the occasional match notification
4. Wait - use the `sleep` function to make the script wait a few seconds before the next iteration. This makes it behave closer to a human in terms of interacting with the web app.

## You need to make a dummy Tinder account

To start swiping, you need to make an account on Tinder for the bot to use. Upon registration, you must declare if you are male or female, which cannot be altered after creation. However, you may change your own preference in the user settings at any time. 

Tinder only matches based on sex, so if you created a male account, you can scrape `straight female` profiles if you set your profile to find only females. Likewise, changing the settings on the same account to find males only will allow you to retrieve `homosexual males`. The limitation here is that we can never truly be sure if someone is bisexual.

## Importing Libraries

In [15]:
from selenium import webdriver
from time import sleep
import re
import numpy as np
import pandas as pd
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup as soup
from selenium.webdriver.chrome.options import Options
import random
import os
from IPython.display import clear_output
import matplotlib.pyplot as plt

### Set up your proxy

You might want to do this in case Tinder upgraded and decided to block IP's of bots they detect. Note that the proxy below is just a placeholder and should be replaced with an actual one.

In [2]:
# setting proxy for scraping
os.environ['HTTP_PROXY'] = 'http://54.238.250.91:8080/'
os.environ['HTTPS_PROXY'] = 'http://54.238.250.91:8080/'

### Initialize driver object

You'll be initializing a driver object to use `Selenium`. Just google `<Your Brower Name Here> Driver` and you should probably get a bunch of download options. Just check your browser settings for the version number and then download the equivalent driver version. Place the driver file in the same folder as your notebook.

In [3]:
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)\
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 \
        Safari/537.36")

driver = webdriver.Chrome('chromedriver', options=opts)
driver.get('https://tinder.com/')

### Login

At this point, switch to the browser window opened by `Selenium` and log in manually using your phone number. Handle popups manually. This process of course can be automated as well, but it's much easier to log in manually. Your browser window should be in the state where it can view and swipe through profiles.

## Swipe and Scrape Loop

The code below will work as long as you fulfill the conditions stated above. The window should start automatically swiping as you run the code. 

In [5]:
counter = 0
loop = 500

`counter` keeps track of how many profiles the script has scraped, and will update per iteration. `loop` tells the script how many profiles to scrape before stopping. 

You'll notice that the script is basically a bunch of `try - except` statements since not all profiles display the same information fields. Every iteration initializes a dictionary with keys corresponding to each column of information. This dictionary is appended to a CSV file as a `pandas dataframe` object.

In [8]:
for i in range(loop):
    
    #swipe up
    driver.find_element_by_tag_name('body').send_keys(Keys.UP)
    
    sleep(5)
    print('pressed up, finished sleeping')
    params = {'sex': 'male', 'work': [], 'age': [], 'educ': [] ,'bio': [], 'ig': [], 'piclink': [], 'picnumbers':[],
             'spotify_artist_1' : [], 'spotify_artist_2':[]}
    
    #grab page html for the whole page
    try:
        page_html = soup(driver.page_source, "lxml") 
    except:
        print("Can't get html of page.")
        break
    
    try:
        # get age
        profilehead = driver.find_element_by_css_selector('div.profileCard__header__info')
        params['age'] = re.findall("\n(\d{2})", profilehead.text)[0]        
    except:
        params['age'] = 'None'
        print("No Age")
        
    try:
        # finds the bio
        bio = page_html.select_one('div.profileCard__bio')
        params['bio'] = bio.text
    except:
        params['bio'] = 'None'
        print("No Bio")
        
    try:
        # work info
        work = page_html.select_one('svg[viewbox="-1 0 16 12"]').parent.parent.select_one("div:nth-of-type(2)")
        params['work'] = work.text
    except:
        params['work'] = 'None'
        print("No Work Info")
        
        
    try:
        # educ info
        educ = page_html.select_one('svg[viewbox="0 0 16 12"]').parent.parent.select_one("div:nth-of-type(2)")
        params['educ'] = educ.text
    except:
        params['educ'] = 'None'
        print("No Educ")
        
    try:
        # picture links
        pics = page_html.select_one("div.profileCard__slider__img").get("style")
        pics = re.findall('url\("([^\(\)]+)"\)', pics)[0]
        params['piclink'] = pics

    except:
        params['piclink'] = 'None'
        print("No pic - ERROR")
        break

    
    try:
        #nunmber of pics
        picbuttons = page_html.select("a.profileCard__slider__backLink > div > div.CenterAlign > div[role='button']")
        params['picnumbers'] = len(picbuttons)
    except:
        params['picnumbers'] = None
        print("No pic numbers - ERROR")
        break
    
    try:
        params['spotify_artist_1'] = None
        params['ig'] = None
        params['spotify_artist_2'] = None
        extras = page_html.select("div.Fw($medium)")
        for item in extras:
            if item.text == "My Anthem":
                params['spotify_artist_1'] = item.parent.select_one("span:nth-of-type(2)").text
            elif re.search("Instagram", item.text):
                params['ig'] = re.findall("\d+", item.text)[0]
            elif item.text == "My Top Spotify Artists":
                params['spotify_artist_2'] = ",".join([i.text for i in item.parent.select("span.Fz($ms)")])
    except:
        print("Error occurring in extra info")

    pd.DataFrame(params, index = [0]).to_csv("0617 1030am.csv", index = False, header = False, mode = "a+")
    counter += 1
    print(counter)
    # swipe left    
    ex = driver.find_element_by_css_selector("#content > span > div > div.App__body.H\(100\%\).Pos\(r\).Z\(0\) > div > main > div.H\(100\%\) > div > div > div.profileCard.Pos\(r\).D\(f\).Ai\(c\).Fld\(c\).Expand--s.Mt\(a\) > div.Pos\(f\).W\(100\%\).B\(0\).Z\(1\).Pos\(a\)--ml.Bdrsbend\(8px\)--ml.Bdrsbstart\(8px\)--ml.Bg\(\$transparent-white-gradient\) > div > button.button.Lts\(\$ls-s\).Z\(0\).Cur\(p\).Tt\(u\).Bdrs\(50\%\).P\(0\).Fw\(\$semibold\).recsGamepad__button.D\(b\).Bgc\(\#fff\).Wc\(\$transform\).Start\(15px\).Scale\(1\.1\)\:h > span")
    print('button selected')
    ex.click()
    sleeptime = random.randint(2,4)
    sleep(4)
    if i % 3:
        clear_output()

pressed up, finished sleeping
No Bio
No Work Info
No Educ
3774
button selected


KeyboardInterrupt: 

# Sample output

Below is an example of an unprocessed dataframe from scraping.

In [3]:
pd.read_csv("testdata.csv")

Unnamed: 0,male,None,20,University Of The Philippines Diliman,"I'll be your fave human 5' 11""",768,https://images-ssl.gotinder.com/5bace5ce439244e7097793a4/640x640_75_932293f6-eeb2-4db8-a48c-ea7d589cc6d8.webp,14,Leon Bridges,"Taylor Swift,Post Malone,Britney Spears,Calvin Harris, Sam Smith, Jessie Reyez"
0,male,,18,Far Eastern University,,,https://images-ssl.gotinder.com/5cfd6934f1d901...,3,,
1,male,,18,,,,https://images-ssl.gotinder.com/5d04d2951e3fa9...,2,,
2,male,,25,,Looking for someone to chat with about Game of...,,https://images-ssl.gotinder.com/5d01e36c01ae8a...,3,Lizzo,
3,male,,19,CHS,Queen of my own little world👸 ✖️ Fan Girl💁,,https://images-ssl.gotinder.com/5d052cfa3f4ac5...,4,,
4,male,McDonald's Binan High Way,18,Lyceum of Alabang,Movie Marathon. Chill,,https://images-ssl.gotinder.com/5d03ded9ae9d0a...,0,,
5,male,,19,,🏳️‍🌈💃🏻,,https://images-ssl.gotinder.com/5d01ea3d53caf6...,3,,
6,male,,20,,I'm just bored most of the time so hmu if we m...,,https://images-ssl.gotinder.com/5d0337ced8c017...,0,,
7,male,,20,Polytechnic University of the Philippines,여러분 안녕하십니까! 🇵🇭💙🇰🇷\r\n\r\nHMU. Mabilis ako mag ...,,https://images-ssl.gotinder.com/5d055c8e0e38f7...,2,,
8,male,,32,Rizal Technological University,,,https://images-ssl.gotinder.com/5d0549a28b4a57...,3,,


# Acknowledgements

I'd like to thank my team mates, Ella Manasan, Marvin Belina, and Naman Punit for being such a wonderful team in our second term in graduate school, and Prof. Christian Alis for his guidance in all our technical endeavors.