# Notebook description

#### What is this notebook for? 
This this the first notebook of a serie of five. Its main purpose is to collect the data that will be used for the rest of the project. 


The process of data collection will be divided in multiple processes: 
1. Create a list of pages/account where I found bots' comments 
2. For each, I look at their last 50+ posts to collect the ID of the post 
3. Once done, loop through those thousands of IDs to collect the comments
4. 



collected the IDs of  from multiple pages suseptible to have bots commenting on their posts. I selected those pages manually by making sure they're all bots' targets. Then, I'll loop through each post thanks to it's ID. On instagram, each post url is made as `instagram.com/p/{postid}`

## Script

## Import modules and UDFs

In [3]:
#Personnal module, functions I use often
import src.webscraping as mw
import src.useful as mu


# Better print 
from tqdm import tqdm

# To become Dr Strange 
import time 
from datetime import datetime

# Basic data 
import pandas as pd
import numpy as np

# Store data 
import sqlite3
import json
from flatten_json import flatten

# Move things around locally
import shutil

# Fetch instagram data 
from instaloader import Instaloader, Profile
from instaloader.exceptions import ProfileNotExistsException
import urllib
from splinter import Browser

import os

from coolname import generate_slug

### Setting notebook preferences

In [5]:
pd.set_option("display.max_columns", None)

In [4]:
# Creating SQL database to store all the data for this project
database = "data/main_database.sqlite"
con = sqlite3.connect(database)

## 1. Collect post ids

List a number of instagram accounts that have bots commenting on their posts. Here I'm looping through a page list that is targeted by bots and collect the posts_ids one by one.

In [2]:
# Listing pages targeted by bots
pages = ["nfl", "championsleague", "mercedesamgf1", "ESPN", "bleacherreport", "houseofhighlights", "nba", "worldstar", "grmdaily", "pubity", 
         "meme.ig", "brgridiron", "lakers", "ballislife", "nflonfox", "nflnetwork", "espnnfl", "cbssports", "thecheckdown"]
len(pages)

19

In [8]:
# Lunch browser
browser = Browser('chrome')

post_per_page = 50
for page in pages:
    browser.visit(f"https://www.instagram.com/{page}/")
    postids = []
    

    while len(postids) < post_per_page:
        # Scroll up and then down each time helps the page to not bug
        browser.execute_script("window.scrollTo(0, 0);")
        time.sleep(0.5)
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        
        # Instagram doesn't show in the html the posts it of the post we don't see so I need to slowly scroll down to collect each of them.
        browser.execute_script("window.scrollBy(0, 200);")

        # Change the browser to a beautiful soup object where I can get the posts id
        soup = mw.bsoup(browser)
        for element in soup.find_all("a"):
            link = element.get("href")
            if "/p/" in link: # We can find the posts id by looking into a tags that have an attribute of href.
                postids.append(link.replace("/p/", "").replace("/", ""))

        # Create a df with post_ids and save it in the db
        df_post_ids = pd.DataFrame(set(postids), columns=["post_id"])
        df_post_ids['page'] = page
        df_post_ids.to_sql("post_ids", con, if_exists="append", index=False)

        # Display random timer to do nothing
        for i in list(range(np.random.randint(10, 30)))[::-1]:
            print(i, end="\r")
            time.sleep(1)

## 2. Scrape comments

In [22]:
# Query post_id not in comments table 
query = """
SELECT  
    DISTINCT
    post_id
    , page_name
FROM post_ids 
"""

# Loadind post_ids from db
posts_ids_to_scrape = pd.read_sql_query(query, con)
posts_ids_to_scrape

Unnamed: 0,post_id,page_name
0,CY23Dx-hQzz,br_hoops
1,CYhycdhh__L,br_hoops
2,CXpjIIagKAx,br_hoops
3,CYxi2doBjWu,br_hoops
4,CYDVbgjhJEq,br_hoops
...,...,...
7826,CWIYygLtmsJ,mercedesamgf1
7827,CZhKJh_uJne,mercedesamgf1
7828,CWROwfUMZR4,mercedesamgf1
7829,CYWw8V4tRPD,mercedesamgf1


In [415]:
# Scraping and storing comments from post_ids:
np.random.shuffle(posts_ids_to_scrape)
for post_id in tqdm(posts_ids_to_scrape):
    browser.visit(f"https://www.instagram.com/p/{post_id}/")
    
    # Wait for page to load and get its soup
    while True: 
        soup = mw.bsoup(browser)
        if soup.find("h2", class_="_a9zc") != None: 
            break

    # Get days since posted
    post_posted_time = soup.find("time", class_="_a9ze _a9zf").get("datetime")
    now = datetime.now()
    days_diff = (now - pd.to_datetime(post_posted_time[:-1])).days

    # I only keep what was posted less than a month ago so I don't get too old data
    if days_diff > 31: 
        continue
    
    df_post_comments = pd.DataFrame(columns=["post_id", "page", "legend", "post_posted_time", "username", 
                                             "full_comment_data", "comment", "comment_posted_time", 
                                             "time_since_posted", "comments_likes", "replies",
                                             "time_now"])

    for comment_block in soup.find_all("ul", class_="_a9ym"):
        page = soup.find("h2", class_="_a9zc").text 
        legend = soup.find("div", class_="_a9zs").text 
        time_since_posted = comment_block.find("time", class_="_a9ze _a9zf").text 
        username = comment_block.find("h3", class_="_a9zc").text 
        comment_posted_time = comment_block.find("time", class_="_a9ze _a9zf").get("datetime")
        comments_likes = comment_block.find("button", class_="_a9ze").text 
        comment = comment_block.find("div", class_="_a9zs").text  
        replies = comment_block.find("li", class_="_a9yg").text if comment_block.find("li", class_="_a9yg") != None else ""
        full_comment_data = comment_block.text
        time_now = datetime.now()

        # Add comment values to dataframe
        df_post_comments.loc[len(df_post_comments)] = [post_id, page, legend, post_posted_time, username, 
                                                       full_comment_data, comment, comment_posted_time, 
                                                       time_since_posted, comments_likes, replies,
                                                       time_now]

        
    df_post_comments.to_sql("comments", con, if_exists="append", index=False)
    time.sleep(np.random.randint(40, 60))

### Create cooler name 

In [16]:
df_post_comments = pd.read_sql_query('select distinct username from comments', con)
all_usernames = df_post_comments['username']

# Get a cool name for each user and store the mapping in the database
cooler_names = {username: generate_slug(3) for index, username in enumerate(all_usernames)}
df_username_mapping = pd.DataFrame(cooler_names.items(), columns=['username', 'cooler_name'])
df_username_mapping.to_sql("username_mapping", con, index=False)


# Mapping usernames 
df_post_comments['username'] = df_post_comments['username'].map(cooler_names)
df_post_comments.to_sql("comments", con, if_exists="replace", index=False)

## 3. Collect user data from comments

In [7]:
# Instanciate Instaloader
L = Instaloader()

def fetch_user_data (username):
    '''Function to fetch an Instagram user's public data.

    Parameter: 
        username str: username of the user to collect the 
        data from.
    '''

    try:
        profile = Profile.from_username(L.context, username)
    except ProfileNotExistsException:
        return f"{username} does not exists anymore"
        
    data = profile.__dict__
    del data["_context"]
    json_object = json.dumps(data, indent = 2)   
    with open(f"data/users_json/{username}_user_profile_data.json", 'w') as file_object:  
        json.dump(json_object, file_object) 

In [29]:
def convert_json (username):
    # Open json 
    try:
        json_file = open(f"data/users_json/{username}_user_profile_data.json")
        data = json.loads(json.load(json_file))["_node"]
    except json.decoder.JSONDecodeError:
        return

    # Useless json keys
    ban = ['edge_felix_video_timeline', 'edge_owner_to_timeline_media', 'edge_saved_media', 
            'edge_media_collections', 'edge_related_profiles']

    # Defining basic keys. Those are name, follower count, bio, etc. Basic infos
    basic_keys = [key for key in data.keys() if key not in ban]
    basic_info = flatten({key:data[key] for key in basic_keys})
    df_current_user = pd.DataFrame([basic_info])

    # Getting the data for all posts that are contained in lists. 
    df_current_user["video_count"] = data['edge_felix_video_timeline']["count"]
    df_current_user["post_count"] = data['edge_owner_to_timeline_media']["count"]

    last_12_posts = dict()
    posts = data['edge_owner_to_timeline_media']["edges"]
    last_12_posts["username"] = data["username"]
    last_12_posts["video_views"] = [post["node"]["video_view_count"] if "video_view_count" in post["node"].keys() else np.nan for post in posts]
    last_12_posts["display_url"] = [post["node"]["display_url"] for post in posts]
    last_12_posts["thumbnail_src"] = [post["node"]["thumbnail_src"] for post in posts]
    last_12_posts["accessibility_caption"] = [post["node"]["accessibility_caption"] for post in posts]
    last_12_posts["is_video"] = [post["node"]["is_video"] for post in posts]
    last_12_posts["likes"] = [post["node"]["edge_liked_by"]["count"] for post in posts]
    last_12_posts["comments"] = [post["node"]["edge_media_to_comment"]["count"] for post in posts]
    last_12_posts["timestamp"] = [post["node"]["taken_at_timestamp"] for post in posts]
    df_last_12_posts = pd.DataFrame(last_12_posts)


    # Changing list type to str as sqlite3 doesn't accept this type
    list_features = ['bio_links', 'biography_with_entities_entities', 'edge_mutual_followed_by_edges', 'pronouns']
    for column in list_features: 
        if column in df_current_user.columns:
            df_current_user[column] = df_current_user[column].apply(lambda x:  "_LIST_SEPARATOR_".join(x))

    return df_current_user, df_last_12_posts

In [9]:
# Query post_id not in comments table 
query = """
SELECT  
    DISTINCT
    username
FROM comments 
WHERE 1=1
    AND username NOT IN (SELECT username FROM users)
"""

# Loading usernames to scrape from db
usernames_to_scrape = pd.read_sql_query(query, con)["username"]
print(f"Usernames to scrape: {len(usernames_to_scrape)}")


for index, username in enumerate(tqdm(usernames_to_scrape[:5])): 
    fetch_user_data(username)
    
    try:
        df_current_user, df_current_last_12_posts = convert_json(username)
    except FileNotFoundError: # If an error happened while fetching the data, no file
        continue

    # Not all users have the same numbaer of columns returned and SQL needs same cols to use 'append'
    df_users = pd.read_sql_query('select * from users', con)
    df_users = pd.concat([df_users, df_current_user]).drop_duplicates()
    df_users.to_sql("users", con, if_exists="replace", index=False)
    df_current_last_12_posts.to_sql("last_12_posts", con, if_exists="append", index=False)

    # Profile pic
    for username, profile_pic_url in df_current_user[["username",  "profile_pic_url"]].values: 
        try:
            urllib.request.urlretrieve(profile_pic_url, f"data/photos/user_profile_pictures/{username}_pp_user_photo.png")
        except Exception as e:
            print(username, e, end='\r')

    # Last 12 posts
    df_current_last_12_posts = df_current_last_12_posts.reset_index()
    for username, display_url, index in df_current_last_12_posts[["username", "display_url", "index"]].values: 
        urllib.request.urlretrieve(display_url, f"data/photos/user_last_12_posts/{username}_{index}_user_photo.png") 



Usernames to scrape: 88608


100%|██████████| 5/5 [00:13<00:00,  2.68s/it]


### 3.4 Screenshot bio url landing page

In [11]:
# Query users having a link and remove those with NA
query = """
SELECT 
    DISTINCT 
    username
    , external_url 
FROM users 
WHERE external_url IS NOT NULL"""

df_user_urls = pd.read_sql_query(query, con)
df_user_urls

False

In [None]:
browser = Browser("chrome")
for username, external_url in tqdm(df_user_urls.values):
    try:
        browser.visit(external_url)
        browser.driver.save_screenshot(f"data/url_screenshot/{username}_external_url_screenshot.png")
        browser = mw.launch_driver("/Users/marclamy/Desktop/main file/code/igbot_final/chromedriver")
    except:
        ...

100%|██████████| 36/36 [02:28<00:00,  4.12s/it]


## 4. Giving cooler names to the users 

In [19]:
from coolname import generate_slug

# Loading all dfs and creating a list of it
df_comments = pd.read_sql_query('select * from comments', con)
df_users = pd.read_sql_query('select * from users', con)
df_last_12_posts = pd.read_sql_query('select * from last_12_posts', con)
all_dfs = [df_comments, df_users, df_last_12_posts]

# List all usernames
all_usernames = [username for df in all_dfs for username in df['username']]
print(len(all_usernames))
all_usernames = list(set(all_usernames))
print(len(all_usernames))

584202
88697


In [13]:
pd.read_sql_query('select * from comments', con)[['username']]['username'].map()

0              carrot-mastodon-of-enthusiasm
1                   new-statuesque-binturong
2                       adorable-jade-beluga
3                  hysterical-glistening-bee
4                 uber-dragon-of-advertising
                         ...                
134460                  unique-gaur-of-karma
134461                   prudent-rare-dragon
134462              grinning-rugged-parakeet
134463                  imported-famous-tody
134464    important-partridge-of-performance
Name: username, Length: 134465, dtype: object

In [21]:
df_users['username'].head(10), df_comments['username'].head(10), df_last_12_posts['username'].head(10)

(0     crazy-antelope-of-atheism
 1       garrulous-violet-grouse
 2             daring-modest-owl
 3        tunneling-maroon-potoo
 4           merry-silver-grouse
 5        amphibian-gifted-mouse
 6         funky-original-pigeon
 7    delectable-hypnotic-mantis
 8    berserk-determined-gazelle
 9          natural-heavy-locust
 Name: username, dtype: object,
 0    invisible-fennec-of-teaching
 1      invisible-sparkling-avocet
 2         sociable-secret-buzzard
 3          quaint-elated-starling
 4      nostalgic-tactful-mosquito
 5     maize-axolotl-of-renovation
 6       delectable-gifted-unicorn
 7             crafty-caped-ermine
 8         festive-mindful-caracal
 9        adamant-rhino-of-courage
 Name: username, dtype: object,
 0    crazy-antelope-of-atheism
 1    crazy-antelope-of-atheism
 2    crazy-antelope-of-atheism
 3    crazy-antelope-of-atheism
 4    crazy-antelope-of-atheism
 5    crazy-antelope-of-atheism
 6    crazy-antelope-of-atheism
 7    crazy-antelope-of-atheism


In [22]:
df_last_12_posts

Unnamed: 0,username,video_views,display_url,thumbnail_src,accessibility_caption,is_video,likes,comments,timestamp,cool_username
0,crazy-antelope-of-atheism,,https://scontent-lga3-2.cdninstagram.com/v/t51...,https://scontent-lga3-2.cdninstagram.com/v/t51...,"Photo shared by 🗣J on November 22, 2021 taggin...",0,34.0,3.0,1.637630e+09,crazy-antelope-of-atheism
1,crazy-antelope-of-atheism,,https://scontent-lga3-2.cdninstagram.com/v/t51...,https://scontent-lga3-2.cdninstagram.com/v/t51...,"Photo by 🗣J on June 17, 2021. May be a closeup...",0,58.0,7.0,1.623961e+09,crazy-antelope-of-atheism
2,crazy-antelope-of-atheism,,https://scontent-lga3-2.cdninstagram.com/v/t51...,https://scontent-lga3-2.cdninstagram.com/v/t51...,"Photo by 🗣J on June 07, 2021. May be an image ...",0,44.0,2.0,1.623115e+09,crazy-antelope-of-atheism
3,crazy-antelope-of-atheism,,https://scontent-lga3-2.cdninstagram.com/v/t51...,https://scontent-lga3-2.cdninstagram.com/v/t51...,"Photo by 🗣J on March 18, 2021. May be an image...",0,36.0,0.0,1.616098e+09,crazy-antelope-of-atheism
4,crazy-antelope-of-atheism,,https://scontent-lga3-2.cdninstagram.com/v/t51...,https://scontent-lga3-2.cdninstagram.com/v/t51...,"Photo by 🗣J on February 19, 2021. May be an im...",0,64.0,4.0,1.613768e+09,crazy-antelope-of-atheism
...,...,...,...,...,...,...,...,...,...,...
363391,smoky-nippy-malkoha,4238.0,https://scontent-lga3-1.cdninstagram.com/v/t51...,https://scontent-lga3-1.cdninstagram.com/v/t51...,username ...,1,428.0,12.0,1.641483e+09,smoky-nippy-malkoha
363392,smoky-nippy-malkoha,12531.0,https://scontent-lga3-1.cdninstagram.com/v/t51...,https://scontent-lga3-1.cdninstagram.com/v/t51...,username ...,1,952.0,11.0,1.641318e+09,smoky-nippy-malkoha
363393,smoky-nippy-malkoha,69105.0,https://scontent-lga3-1.cdninstagram.com/v/t51...,https://scontent-lga3-1.cdninstagram.com/v/t51...,username ...,1,3584.0,27.0,1.641149e+09,smoky-nippy-malkoha
363394,smoky-nippy-malkoha,96252.0,https://scontent-lga3-1.cdninstagram.com/v/t51...,https://scontent-lga3-1.cdninstagram.com/v/t51...,username ...,1,5958.0,29.0,1.640449e+09,smoky-nippy-malkoha
