# Google Vision Dataset

For this project, we aim to create a regression model to better understand the characteristics that feed into a successful mural. In order to do this, we had to extract images and labels of instagram posts. This notebook showcases the process of creating the dataset for this model.

## Uploading Dataframe

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

#Uploading datasets 

Ig_1 = pd.read_csv('https://github.com/mariasohail2/Social_Media_Analytics/raw/main/instagram1.csv')
Ig_2 = pd.read_csv('https://github.com/mariasohail2/Social_Media_Analytics/raw/main/instagram2.csv')

To perform our analysis and create the dashboard, we used two datasets retrieved from Instaloader. In the first dataset, our focus was to scrape comments linked from posts related to popular mural hashtags like #urbanart, #mural and #streetart. The second dataset contained posts and their corresponding captions from two prominent urban art accounts.

In [None]:
Ig_1.head(5)

Unnamed: 0.1,Unnamed: 0,Caption,Comments,Likes,URL
0,0,“Always Stay Hungry” - work by @sasha.korban f...,3.0,175.0,https://scontent-atl3-1.cdninstagram.com/v/t51...
1,1,Work by @ronenglish on the famed Houston Bower...,2.0,115.0,https://scontent-atl3-1.cdninstagram.com/v/t51...
2,2,New work by @invaderwashere in Slovenia. 👾,2.0,111.0,https://scontent-atl3-1.cdninstagram.com/v/t51...
3,3,"Work by @davidzinn in Ann Arbor, Michigan.",30.0,3289.0,https://scontent-atl3-1.cdninstagram.com/v/t51...
4,4,"“Touché, coulé” - new work by @matth.velvet in...",2.0,189.0,https://scontent-atl3-1.cdninstagram.com/v/t51...


In [None]:
Ig_2.head(5)

Unnamed: 0.1,Unnamed: 0,Caption,Comments,Likes,URL
0,0,"@timo_levin wall in Kamianske, Ukraine 🇺🇦(2021...",11.0,2276.0,https://scontent-bos3-1.cdninstagram.com/v/t51...
1,1,"@jr wall in Paris, France 🇫🇷 (2021)\n•\n#jr #u...",46.0,6517.0,https://scontent-bos3-1.cdninstagram.com/v/t51...
2,2,"@3ttman wall in Lodz, Poland 🇵🇱(2013)\n•\n#3tt...",37.0,5589.0,https://scontent-bos3-1.cdninstagram.com/v/t51...
3,3,"@jessieandkatey wall in Philadelphia, USA 🇺🇸 (...",31.0,5188.0,https://scontent-bos3-1.cdninstagram.com/v/t51...
4,4,"@romanlinacero wall in Nava de la Asunción, Se...",24.0,3584.0,https://scontent-bos3-1.cdninstagram.com/v/t51...


In [None]:
#Merging datases into one
frames = [Ig_1, Ig_2]
Df = pd.concat(frames)
Df.head(5)

Unnamed: 0.1,Unnamed: 0,Caption,Comments,Likes,URL
0,0,“Always Stay Hungry” - work by @sasha.korban f...,3.0,175.0,https://scontent-atl3-1.cdninstagram.com/v/t51...
1,1,Work by @ronenglish on the famed Houston Bower...,2.0,115.0,https://scontent-atl3-1.cdninstagram.com/v/t51...
2,2,New work by @invaderwashere in Slovenia. 👾,2.0,111.0,https://scontent-atl3-1.cdninstagram.com/v/t51...
3,3,"Work by @davidzinn in Ann Arbor, Michigan.",30.0,3289.0,https://scontent-atl3-1.cdninstagram.com/v/t51...
4,4,"“Touché, coulé” - new work by @matth.velvet in...",2.0,189.0,https://scontent-atl3-1.cdninstagram.com/v/t51...


For the purpose of this project, we merged these datasets together to have a larger dataset that better encopassed streetart.

## Object detection 

In this section, we will be extracting the objects identified and their associated scored from Google Vision for each instagram post.

In [None]:
import xlrd
from google.cloud import vision
import os
import pandas as pd

Application_Credentials = '/Users/tahreembutt/Desktop/My Project 74932-95eea7d0bd8b.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = Application_Credentials
client = vision.ImageAnnotatorClient()
image = vision.Image()

In [None]:
Df["Object_Num"] = np.nan

Creating new column to store total number of objects identified per image.

In [None]:
Df

Unnamed: 0.1,Unnamed: 0,Caption,Comments,Likes,URL,Object_Num
0,0,“Always Stay Hungry” - work by @sasha.korban f...,3.0,175.0,https://scontent-atl3-1.cdninstagram.com/v/t51...,
1,1,Work by @ronenglish on the famed Houston Bower...,2.0,115.0,https://scontent-atl3-1.cdninstagram.com/v/t51...,
2,2,New work by @invaderwashere in Slovenia. 👾,2.0,111.0,https://scontent-atl3-1.cdninstagram.com/v/t51...,
3,3,"Work by @davidzinn in Ann Arbor, Michigan.",30.0,3289.0,https://scontent-atl3-1.cdninstagram.com/v/t51...,
4,4,"“Touché, coulé” - new work by @matth.velvet in...",2.0,189.0,https://scontent-atl3-1.cdninstagram.com/v/t51...,
...,...,...,...,...,...,...
841,841,"@findac wall in Paris, France 🇫🇷(2018)\n•\n#fi...",23.0,4403.0,https://scontent-bos3-1.cdninstagram.com/v/t51...,
842,842,"@artofdavidwalker wall in Nancy, France 🇫🇷(201...",62.0,9424.0,https://scontent-bos3-1.cdninstagram.com/v/t51...,
843,843,"@smeetsbart wall in Brussels, Belgium 🇧🇪(2018)...",18.0,3360.0,https://scontent-bos3-1.cdninstagram.com/v/t51...,
844,844,"@fintan_magee wall in Munich, Germany 🇩🇪(2018)...",21.0,2463.0,https://scontent-bos3-1.cdninstagram.com/v/t51...,


In [None]:
for i in range(len(Df)):
    a = Df.iloc[i,4]
    client = vision.ImageAnnotatorClient()
    image = vision.Image()
    image.source.image_uri = a
    objects = client.object_localization(image=image).localized_object_annotations
    Df.iloc[i,5] = len(objects)

In [None]:
Df["Object_Num"] = Df["Object_Num"].astype(int)

In [None]:
#Creating columns to store 
for i in range(int(max(Df["Object_Num"]))):
    Df["Object_"+str(i)+"_score"] = np.nan

for i in range(int(max(Df["Object_Num"]))):
    Df["Object_"+str(i)] = np.nan
    
#Temporary Storage for row object names and score
Df["Total_Objects_Names"] = np.nan
Df["Total_Objects_Scores"] = np.nan

In [None]:
for i in range(len(Df)):
    a = Df.iloc[i,4] #Link
    client = vision.ImageAnnotatorClient()
    image = vision.Image()
    image.source.image_uri = a
    objects = client.object_localization(image=image).localized_object_annotations
    Object_List = []
    Object_Score = []
    for object_ in objects:
        Object_List.append(object_.name)
        Object_Score.append(str(object_.score))
    Df["Total_Objects_Names"].iloc[i] = Object_List
    Df["Total_Objects_Scores"].iloc[i] = Object_Score

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Getting all objects and scores for an image and storing it in one column.

In [None]:
#Inputting scores in the correct columns
for i in range(len(Df)):
    l = Df.iloc[i,27] #List with scores
    l = l.strip("]")
    l = l.strip("[")
    split_list = [] #Storing scores
    split_list = l.split(",")
    Obj_Num = Df.iloc[i,5]
    for g in range(Obj_Num):
        Df.iloc[i, Df.columns.get_loc("Object_"+str(g)+"_score")] = split_list[g]

Inputting scores in individual columns.

In [None]:
#Inputting names in the correct columns
for i in range(len(Df)):
    l = Df.iloc[i,26] #List with names
    l = l.strip("]")
    l = l.strip("[")
    split_list = [] #Storing names
    split_list = l.split(",")
    Obj_Num = Df.iloc[i,5]
    for g in range(Obj_Num):
        Df.iloc[i, Df.columns.get_loc("Object_"+str(g))] = split_list[g]

Inputting names in individual columns.

## Label Detection

In this section, we will be extracting the labels identified and their associated scored from Google Vision for each instagram post.

In [None]:
Df["Label_Num"] = 10

Google Vision always identifies 10 labels per image. 

In [None]:
#Creating columns to store 
for i in range(int(max(Df["Label_Num"]))):
    Df["Label_"+str(i)+"_score"] = np.nan

for i in range(int(max(Df["Label_Num"]))):
    Df["Label_"+str(i)] = np.nan
    
Df["Labels_Objects_Names"] = np.nan
Df["Labels_Objects_Scores"] = np.nan

In [None]:
for i in range(len(Df)):
    a = Df.iloc[i,4]
    client = vision.ImageAnnotatorClient()
    image = vision.Image()
    image.source.image_uri = a
    response = client.label_detection(image=image)
    labels = response.label_annotations
    Labels_List = []
    Labels_Score = []
    for label in labels:
        Labels_List.append(label.description)
        Labels_Score.append(str(label.score))
    Df["Labels_Objects_Names"].iloc[i] = Labels_List
    Df["Labels_Objects_Scores"].iloc[i] = Labels_Score

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Getting all labels and scores for an image and storing it in one column.

In [None]:
#Inputting scores in the correct columns
for i in range(len(Df)):
    l = Df.iloc[i,50] #List with scores
    l = l.strip("]")
    l = l.strip("[")
    split_list = [] #Storing scores
    split_list = l.split(",")
    Label_Num = 10
    for g in range(Label_Num):
        Df.iloc[i, Df.columns.get_loc("Label_"+str(g)+"_score")] = split_list[g]


Inputting scores in individual columns.

In [None]:
#Inputting names in the correct columns
for i in range(len(Df)):
    l = Df.iloc[i,49] #List with names
    l = l.strip("]")
    l = l.strip("[")
    split_list = [] #Storing names
    split_list = l.split(",")
    Obj_Num = 10
    for g in range(Obj_Num):
        Df.iloc[i, Df.columns.get_loc("Label_"+str(g))] = split_list[g]

Inputting labels in individual columns.

In [None]:
Df

Unnamed: 0.1,Unnamed: 0,Caption,Comments,Likes,URL,Object_Num,Object_0_score,Object_1_score,Object_2_score,Object_3_score,...,Label_2,Label_3,Label_4,Label_5,Label_6,Label_7,Label_8,Label_9,Labels_Objects_Names,Labels_Objects_Scores
0,0,“Always Stay Hungry” - work by @sasha.korban f...,3,175,https://scontent-atl3-1.cdninstagram.com/v/t51...,2,0.781665,0.530012,,,...,White,Window,Infrastructure,Street fashion,Automotive lighting,Eyewear,Art,Neighbourhood,"['Building', 'Car', 'White', 'Window', 'Infras...","['0.964320719242096', '0.933414101600647', '0...."
1,1,Work by @ronenglish on the famed Houston Bower...,2,115,https://scontent-atl3-1.cdninstagram.com/v/t51...,1,0.710405,,,,...,Graffiti,Brick,Building,Facade,Mural,Rectangle,City,Illustration,"['Painting', 'Art', 'Graffiti', 'Brick', 'Buil...","['0.8222779035568237', '0.8212408423423767', '..."
2,2,New work by @invaderwashere in Slovenia. 👾,2,111,https://scontent-atl3-1.cdninstagram.com/v/t51...,1,0.929607,,,,...,Font,Line,Gas,Symmetry,Pattern,Creative arts,Illustration,Wood,"['Rectangle', 'Art', 'Font', 'Line', 'Gas', 'S...","['0.8679025769233704', '0.8473606109619141', '..."
3,3,"Work by @davidzinn in Ann Arbor, Michigan.",30,3289,https://scontent-atl3-1.cdninstagram.com/v/t51...,0,,,,,...,Brickwork,Brick,Building material,Wall,Line,Composite material,Pattern,Wood,"['Building', 'Rectangle', 'Brickwork', 'Brick'...","['0.8826923370361328', '0.8644804954528809', '..."
4,4,"“Touché, coulé” - new work by @matth.velvet in...",2,189,https://scontent-atl3-1.cdninstagram.com/v/t51...,4,0.695513,0.635542,0.555453,0.550535,...,Window,Building,Paint,Azure,Art paint,Infrastructure,Urban design,Art,"['Sky', 'Daytime', 'Window', 'Building', 'Pain...","['0.9444730877876282', '0.9441952705383301', '..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1082,841,"@findac wall in Paris, France 🇫🇷(2018)\n•\n#fi...",23,4403,https://scontent-bos3-1.cdninstagram.com/v/t51...,4,0.730663,0.568791,0.561120,0.525850,...,Window,World,Sculpture,Statue,Art,Facade,Monument,City,"['Sky', 'Building', 'Window', 'World', 'Sculpt...","['0.958717942237854', '0.9427456855773926', '0..."
1083,842,"@artofdavidwalker wall in Nancy, France 🇫🇷(201...",62,9424,https://scontent-bos3-1.cdninstagram.com/v/t51...,10,0.866586,0.861993,0.858894,0.820464,...,Wheel,Daytime,Bicycle wheel,Building,Infrastructure,Human,Paint,Painting,"['Bicycle', 'Tire', 'Wheel', 'Daytime', 'Bicyc...","['0.973189115524292', '0.9658660292625427', '0..."
1084,843,"@smeetsbart wall in Brussels, Belgium 🇧🇪(2018)...",18,3360,https://scontent-bos3-1.cdninstagram.com/v/t51...,5,0.888641,0.665505,0.660465,0.652895,...,Grey,Sneakers,Flooring,Rolling,Floor,Wall,Recreation,Walking shoe,"['Shoe', 'Art', 'Grey', 'Sneakers', 'Flooring'...","['0.9540275931358337', '0.8576368093490601', '..."
1085,844,"@fintan_magee wall in Munich, Germany 🇩🇪(2018)...",21,2463,https://scontent-bos3-1.cdninstagram.com/v/t51...,5,0.909237,0.842865,0.769731,0.751040,...,Window,Building,Sky,Flash photography,Waist,Standing,Shorts,Thigh,"['Arm', 'Shoulder', 'Window', 'Building', 'Sky...","['0.943327784538269', '0.9427759647369385', '0..."


Dataframe with labels and objects in individual columns with their associated score.

## Data Cleaning

In this section, we are cleaning the datat so the content is easily parsible for the model.

In [None]:
for g in range(0,10):
    Df["Object_"+str(g)] = Df["Object_"+str(g)].str.replace(r"[\"\',]", '')
    
for g in range(0,10):
        Df["Object_"+str(g)+"_score"] = Df["Object_"+str(g)+"_score"].str.replace(r"[\"\',]", '')
        Df["Object_"+str(g)+"_score"] = pd.to_numeric(Df["Object_"+str(g)+"_score"])

In [None]:
for g in range(0,10):
        Df["Label_"+str(g)+"_score"] = Df["Label_"+str(g)+"_score"].str.replace(r"[\"\',]", '')
        Df["Label_"+str(g)+"_score"] = pd.to_numeric(Df["Label_"+str(g)+"_score"])

for g in range(0,10):
     Df["Label_"+str(g)] = Df["Label_"+str(g)].str.replace(r"[\"\',]", '')

Now that this dataset is populated and cleaned, we will be using this to create a random forest regressor model.