# Applied Data Analysis project

## Instagram Sentiment Classification

The role of this notebook is to walk you through the whole process of Instagram Sentiment Classification part of the project in a way that will allow you to easily reproduce the results yourself.     

### Goal formulation
In this part of the project our goal was to detect sentiment in the Instagram images and images only, i.e we have disregarded all textual description of the image. This was done to reduce our task to an image classification problem. 

The name "Sentiment analysis" might be a bit misleading here. We do not want the sentiment per se, as in happy versus sad face occurring in the image. Instead, we consider an image to have a positive sentiment if it depicts place that it is pleasant to walk through or by. 

The original plan was to do something in style of [goodcitylife.org](http://www.goodcitylife.org/), i.e. classifying a sentiment of particular streets in some swiss city. However, our data does not contain location information with such a level of granularity. Therefore, we have redefined our goal to be sentiment (or pleasantness) analysis of cities in Switzerland instead of streets in one specific swiss city.

### General approach definition

At first, we have planned to train a new convolutional network using data from [image-net.org](http://image-net.org/). Image-net.org contains extensive amount of images divided into categories by the object that is depicted in them. We have wanted to download several categories, positive (e.g. mountains, lakeside), neutral (faces, dogs, food) and negative (graffiti, thrash) and train the network on them. However, we have realized that this would have taken too much time to train. 

After some research, we have found out about [Inception v3](https://arxiv.org/abs/1512.00567) - a convolutional neural network model pre-trained on all 1000 image categories of image-net. We have tested it on sample of our data and, as it performed quite well, we have decided to include it in our pipeline.

![Inception v3 Architecture](inception_v3.png)
<span style="width:100%;text-align:center;float:right;"><i>Schematic diagram of Inception-v3 (source: https://codelabs.developers.google.com/codelabs/cpb102-txf-learning/)</i></span><br>


Inception v3 model returns labels of N most probable objects depicted in the image together with confidence of the classification. We have desided to leverage this and use these labels and confidences to calculate the sentiment. An example of an image from our dataset as an input and Inception v3 output on that image follow:  
<img src="https://scontent.cdninstagram.com/t51.2885-15/e35/14063668_297414773969254_2078795989_n.jpg?ig_cache_key=MTMyOTc0Mzk5ODMxOTA1MjE3MA%3D%3D.2" alt="Sample Instagram Image" style="width: 300px; margin-right: 10px;" align="left"/>
<br>
{  
&nbsp;&nbsp;&nbsp;&nbsp;"alp": 0.016730202361941338,  
&nbsp;&nbsp;&nbsp;&nbsp;"dam, dike, dyke": 0.2930382788181305,  
&nbsp;&nbsp;&nbsp;&nbsp;"lakeside, lakeshore": 0.29257863759994507,  
&nbsp;&nbsp;&nbsp;&nbsp;"promontory, headland, head, foreland": 0.07077361643314362,  
&nbsp;&nbsp;&nbsp;&nbsp;"valley, vale": 0.10409216582775116  
}  


### Technical discussion
Even though we have some experience with [TensorFlow](https://www.tensorflow.org/) library, our original plan has been to use [Keras](https://keras.io/) library for the convolutional neural network, just for the sake of learning something new. However, as we have been quite constrained on time available, in the end we have decided to use only TensorFlow library for implementing the neural network.

However, we had problems running TensorFlow on [Spark](http://spark.apache.org/) cluster that we were provided by [EPFL](http://epfl.ch/). Thus, we have resorted to using [Amazon AWS EC2](https://aws.amazon.com/ec2/) servers. 

We have used [python3.5](https://www.python.org/downloads/release/python-353/) and [bash](https://tiswww.case.edu/php/chet/bash/bashtop.html) in our endeavours.

### Pipeline
I will start the explanation of the pipeline from Felix's output in [/Instagram-Download/](https://github.com/korcek-juraj/epfl-ada16-project/tree/master/Instagram-Download) folder. I have received files form Felix where every line contained Instagram Image ID and its url, the two separated by a space. 

#### Pipeline - 1. Data preprocessing
I wanted to load the file into python and, additionally, in our dataset there were more urls for one Instagram Image ID. Therefore, I had to preprocess the file. This is done by [create_url_dict.py](https://github.com/korcek-juraj/epfl-ada16-project/blob/master/Instagram-Classification/create_url_dict.py) script. It results in python dictionary with every Instagram Image ID as key and a list of urls associated to it as value. The result is pickled into the file with the same name, just with .pickle extension.

In [None]:
# %load ./create_url_dict.py
import os
import argparse
import pickle


def create_url_dict(filename, append_to_dict=None):
    """
    Processes the file passed in filename argument.
    The file is expected to contain key-value pairs (split by space) of Instagram image ID and its url at every line.

    The result is a dictionary with every Instagram Image ID as key and list of urls associated to it as value.

    The parsed results can be returned as a new dictionary or be appended to an existing one passed in argument 'append_to_dict'
    :param filename: a file with Instagram IDs and urls to process
    :param append_to_dict: a dictionary to append the results to; if None, new dictionary is returned
    :return: new dictionary or modified dictionary passed in 'append_to_dict' parameter
    """
    if append_to_dict is None:
        ret = {}
    else:
        ret = append_to_dict

    with open(filename) as urlfile:
        for line in urlfile:
            elem_list = line.split()
            if elem_list[0] in ret:
                ret[elem_list[0]].append(elem_list[1])
            else:
                ret[elem_list[0]] = [elem_list[1], ]
    return ret


def main(filename):
    """
    Parses file passed in filename using create_url_dict method and pickles resulting dictionary for further use.

    Resulting pickle file is saved into the same directory as the input file with the same filename,
    but with extension '.pickle'.
    :param filename: filename of file to be parsed
    :return: None
    """
    filename, ext = os.path.splitext(args.filename)
    urldict = create_url_dict(filename + ext)
    with open(filename + '.pickle', 'wb') as handle:
        pickle.dump(urldict, handle, protocol=pickle.HIGHEST_PROTOCOL)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-f', '--filename', type=str, required=True)
    args = parser.parse_args()

    main(args.filename)


We will run the script on urls from September to October. Just a note, if you want to run the script from commandline instead of notebook, you need to write `python3 create_url_dict.py -f months/september-october_urls.txt` (assuming you are in directory where the script is located).

In [6]:
%run ./create_url_dict.py -f months/september-october_urls.txt

If you check the /Instagram-Classification/months folder you will see that september-october_urls.pickle got created/updated. We will use it in the next step.

#### Pipeline - 2. Parallelized Download & Classification
In order to leverage the computational power of the Amazon AWS EC2 servers we had to parallelize our computation well. Our pipeline consists of $d$ download workers (processes) that download images from Instagram, $c$ classification workers that classify images using TensorFlow library and Inception v3 model, 1 worker to write the results into a file and 1 worker to report progress. To send jobs/data from one worker type to another we have used process-safe queues from python's [multiprocessing](https://docs.python.org/3.5/library/multiprocessing.html) library. 

This way, utilizing the computing architecture to the fullest is a matter of setting couple of parameters correctly. It is important to get balance between download and classification processes right. If you have too little download processes, your classification processes will starve because they will not have anything to process, you will not utilize the computational power to the fullest and thus you will waste time. On the other hand, if you have too many download processes, the classification processes might be too slow to process the images and thus they might start accumulating on your storage until you run out of space and the script crashes.

The [pipeline.py](https://github.com/korcek-juraj/epfl-ada16-project/blob/master/Instagram-Classification/pipeline.py) script takes care of running this.

The images are downloaded into /Instagram-Classification/images/ folder and are removed after they are processed. If some image fails to download, an empty file with name InstagramImageID.jpg.failed is created in the directory. Therefore, it is a good idea to clean the images /Instagram-Classification/images/ folder between the runs.

The classification results are written into file specified by `-o` or `--output_json_file parameter`. If it is not specified, the results are written into result%Y%m%d-%H%M%S.json file in /Instagram-Classification/ directory. This file has the .json extension, however it is not valid JSON as a whole. However, it is each line of this file that contains valid JSON string. This was done so that the results can be written one by one into the file without need to keep all the results in memory which would be the case if if we wanted to write valid JSON file. This has also an advantage that results from different runs can be easily merged into one file by using simple shell command: `cat results1.json results2.json results3.json > results_all.json`

If you do not want to download and analyse all the Instagram Image IDs in the .pickle file (e.g. when testing/debugging) you can use optional script parameter `-t` or `--test_run` to specifify number of Image IDs to process. Not setting this parmater or setting it to 0 results in processing all the Image IDs in the .pickle file.

You can also specify progress reporting interval by setting `-ri` or `--progress_report_interval` to number of seconds you desire. The deafult is 5 seconds. If you set it to 0 no reporting will take place.

The script writes logs into stdout. Probably the best way to run the script is to run it on background using `&` and redirecting the output into, e.g., log.txt.

In [None]:
# %load ./pipeline.py
import pickle
import multiprocessing
from multiprocessing import Process, Queue, Value
from queue import Empty
from urllib import request, error
import os.path
from ctypes import c_bool
import json
import time
import argparse
import errno

import tensorflow as tf
import numpy as np
from classify_image import maybe_download_and_extract, create_graph, NodeLookup, FLAGS


def download_worker(url_queue, image_queue):
    """
    A worker for download processes.

    It executes following steps repeatedly:
    1) It gets Instagram Image ID and corresponding list of urls from "process-safe" queue passed in url_queue parameter.

    2) It attempts to download the image from urls available in the list into images/ folder as InstagramID.jpg file. If
    the image is successfully downloaded using an url, the rest of the urls for given Instagram Image ID in the list is
    ignored. If none of the urls worked, an empty file with name InstagramID.jpg.failed is created in the images/ directory
    for logging purposes.

    3) Once the image is downloaded its filename is put into the "process-safe" queue passed in image_queue parameter.
     From this queue classification workers obtain read-to-be-classified images.

    If an exception is thrown, it is logged and the worker continues to process next Instagram Image ID.

    The process / worker exits once it gets (None, None) tuple from the url_queue.
    :param url_queue: "process-safe" queue containing tuples (Instagram Image ID, list of image urls corresponding to given ID)
    :param image_queue: "process-safe" queue containing image filenames to be processed by classification workers
    :return: None
    """
    while True:
        try:
            img_id, url_list = url_queue.get(timeout=1)
            if img_id is None:
                break
            filename = 'images/' + img_id + '.jpg'
            if os.path.isfile(filename):
                image_queue.put(filename)
                continue
            else:
                for url in url_list:
                    try:
                        request.urlretrieve(url, filename)
                        break
                    except (error.HTTPError, error.URLError):
                        continue
                if not os.path.isfile(filename):
                    open(filename + '.failed','a+').close()
                    print(img_id + ': image missing!', flush=True)
                else:
                    image_queue.put(filename)
        except Empty:
            continue
        except Exception as e:
            print('Unexpected exception in download_worker: ' + str(e), flush=True)


def classification_worker(image_queue, result_queue):
    """
    A worker for classification processes.

    First, it loads TensorFlow Inception v3 CNN model using create_graph() function and initializes TensorFlow session.

    Then, it executes following steps repeatedly:
    1) It gets image filename from "process-safe" queue passed in image_queue parameter.

    2) It runs predictions on the image using predict() function.

    3) It deletes the image.

    4) The result is pushed onto "process-safe" queue passed in result_queue parameter in form of dictionary {img_id: result}.
     This queue feeds the process / worker responsible for saving the results into a file.

    If an exception is thrown, it is logged and the worker continues to process next image file.

    The process / worker exits once it gets None from the image_queue.
    :param image_queue: "process-safe" queue containing filenames of images to be classified
    :param result_queue: "process-safe" queue containing classification results to be processed by results-saving worker
    :return: None
    """
    create_graph()

    with tf.Session() as sess:
        # 'softmax:0': A tensor containing the normalized prediction across
        #   1000 labels.
        softmax_tensor = sess.graph.get_tensor_by_name('softmax:0')

        while True:
            try:
                image = image_queue.get(timeout=1)
                if image is None:
                    break
                img_id, result = predict(image, sess, softmax_tensor)
                result_queue.put({img_id: result})
                os.remove(image)
            except Empty:
                continue
            except Exception as e:
                print('Unexpected exception in classification_worker: ' + str(e), flush=True)


def predict(image, sess, softmax_tensor):
    """
    Function used by classification workers to get prediction on image.

    This method was adapted based on run_inference_on_image() method from classify_image.py found in TensorFlow official tutorial.

    :param image: filename of the image to be classified
    :param sess: TensorFlow session
    :param softmax_tensor: tensor used for computing the predictions
    :return: (img_id, result) with img_id being Instagram Image ID and the result being dictionary with 5 most probable
    objects depicted in the image as keys and corresponding prediction confidences as values.
    """
    img_id = os.path.splitext(os.path.basename(image))[0]
    image_data = tf.gfile.FastGFile(image, 'rb').read()

    # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
    #   encoding of the image.
    predictions = sess.run(softmax_tensor, {'DecodeJpeg/contents:0': image_data})
    predictions = np.squeeze(predictions)

    # Creates node ID --> English string lookup.
    node_lookup = NodeLookup()

    top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1]
    result = {}
    for node_id in top_k:
        human_string = node_lookup.id_to_string(node_id)
        score = predictions[node_id]
        result[human_string] = float(score)

    return img_id, result


def save_result_worker(result_queue, output_file):
    """
    A worker for process saving the classification results into a file.

    First, it opens/creates the outputfile specified in output_file parameter.

    Then, it executes following steps repeatedly:
    1) It gets image result dictionary from "process-safe" queue passed in result_queue parameter.

    2) It converts the result into JSON format and appends it to the output file.

    The process / worker exits once it gets None from the result_queue.
    :param result_queue: "process-safe" queue containing results from image classification in form of one dictionary per image
    :param output_file: filename of the results file to be written to
    :return: None
    """
    dir_path = os.path.dirname(output_file)
    try:
        os.makedirs(dir_path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(dir_path):
            pass
        else:
            raise
    with open(output_file, "a+") as result_file:
        while True:
            try:
                kv = result_queue.get(timeout=1)
                if kv is None:
                    break
                result_file.write(json.dumps(kv) + '\n')
            except Empty:
                continue


def progress_reporting_worker(url_queue, url_queue_orig_len, image_queue, result_queue, progress_report_interval):
    """
    A simple worker for progress reporting and debugging purposes.

    It writes into the stdout number of images remaining to be processed.

    It also writes size of the two queues. This is useful to decide whether there is balance between number of download
    and classification workers.

    The process / worker is expected to be run in daemon mode and thus to exit once its parent process exits. Thus the
    infinite loops is not a problem.
    :param url_queue: "process-safe" queue containing tuples (Instagram Image ID, list of image urls corresponding to given ID)
    :param url_queue_orig_len: original length of url_queue
    :param image_queue: "process-safe" queue containing filenames of images to be classified
    :param result_queue: "process-safe" queue containing results from image classification in form of one dictionary per image
    :param progress_report_interval: interval how often to write the current progress into stdout in seconds
    :return: None
    """
    while True:
        print('Images remaining (length of url_queue): ' + str(url_queue.qsize()) + '/' + str(url_queue_orig_len), flush=True)
        print('image_queue length: ' + str(image_queue.qsize()), flush=True)
        print('result_queue length: ' + str(result_queue.qsize()), flush=True)
        time.sleep(progress_report_interval)


def main(path_to_imgs_pickle, test_run, download_process_no, classification_process_no, output_file, progress_report_interval):
    """
    Sets up and runs the multiprocessing pipeline.

    It consists of following steps:
    1) Downloading Inception v3 model if it is not present. This is done by maybe_download_and_extract() function.

    2) Filling imgid_urls_queue "process-safe" queue with Instagram Image IDs and its corresponding url lists from pickle
     file passed in path_to_imgs_pickle parameter. The amount of key value pairs (Image ID, url list) loaded is limited
     by the parameter test_run. Setting it to 0 or None loads all the available records form the file.

    3) Initialization of  "process-safe" queues img_filename_queue and result_queue.

    4) Spawning number of download process which is specified in download_process_no parameter.
    It also download_process_no-times appends tuples (None, None) to the end of imgid_urls_queue to signal the end of
    the queue for the download workers.

    5) Spawning number of classification process which is specified in classification_process_no parameter.

    6) Spawning worker for saving results into a file.

    7) Spawning progress-reporting worker.

    8) Joining the download workers. Once the download is done None is appened to the end of img_filename_queue
    classification_process_no-times to signal classification workers the end of queue.

    9) Joining classification workers. Once that is done, None is appended result_queue to signal results-saving worker
    the end of queue.

    10) Joining the results-saving worker.

    The progress reporting worker quits automatically with the main (parent) process because it is of daemon type.
    Therefore there is no need to join it.
    :param path_to_imgs_pickle: path to pickle file containing dictionary of Instagram Image IDs as keys and corresponding lists of urls as values
    :param test_run: amount of Instagram Images IDs to classify; None or 0 if all
    :param download_process_no: number of download processes
    :param classification_process_no: number of classification processes
    :param output_file: filename of the results file to be written to
    :param progress_report_interval: interval how often to write the current progress into stdout in seconds; if None or 0 => no progress reporting
    :return: None
    """
    ctx = multiprocessing.get_context('spawn')

    maybe_download_and_extract()

    with open(path_to_imgs_pickle, 'rb') as handle:
        urldict = pickle.load(handle)

    imgid_urls_queue = ctx.Queue()
    i = 0
    for img_id, url_list in urldict.items():
        i += 1
        imgid_urls_queue.put((img_id, url_list))
        if test_run and i >= test_run:
            break
    imgid_urls_queue_orig_len = imgid_urls_queue.qsize()

    img_filename_queue = ctx.Queue()

    download_p_list = []
    for i in range(1, download_process_no + 1):
        download_p = ctx.Process(target=download_worker, args=(imgid_urls_queue, img_filename_queue,))
        download_p_list.append(download_p)
        imgid_urls_queue.put((None, None))
        download_p.start()

    result_queue = ctx.Queue()

    classification_p_list = []
    for i in range(1, classification_process_no + 1):
        classification_p = ctx.Process(target=classification_worker, args=(img_filename_queue, result_queue,))
        classification_p_list.append(classification_p)
        classification_p.start()

    save_results_p = ctx.Process(target=save_result_worker, args=(result_queue, output_file,))
    save_results_p.start()

    if progress_report_interval:
        progress_reporting_p = Process(target=progress_reporting_worker, args=(imgid_urls_queue, imgid_urls_queue_orig_len, img_filename_queue, result_queue, progress_report_interval,))
        progress_reporting_p.daemon = True
        progress_reporting_p.start()

    for p in download_p_list:
        p.join()

    print('Download done!', flush=True)

    for p in classification_p_list:
        img_filename_queue.put(None)

    for p in classification_p_list:
        p.join()

    print('Classification done!', flush=True)

    result_queue.put(None)

    save_results_p.join()

    print('Done!')


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-p', '--path_to_imgs_pickle', type=str, required=True)
    parser.add_argument('-o', '--output_json_file', type=str)
    parser.add_argument('-dp', '--download_process_count', type=int, default=2)
    parser.add_argument('-cp', '--classification_process_count', type=int, default=3)
    parser.add_argument('-t', '--test_run', type=int, default=0, help='amount of images to process, None or 0 if all')
    parser.add_argument('-ri', '--progress_report_interval', type=int, default=5, help='in seconds, if None or 0 => no progress reporting')
    args = parser.parse_args()
    if args.output_json_file:
        output_file = args.output_json_file
    else:
        output_file = 'result' + time.strftime("%Y%m%d-%H%M%S") + '.json'

    main(args.path_to_imgs_pickle, args.test_run, args.download_process_count, args.classification_process_count, output_file, args.progress_report_interval)


Now, we will run the script to download & classify first 100 images from september-october_urls.pickle that we have created in previous step. 

IMPORTANT: The next cell will probably fail becuase interactive notebooks and multiprocessing are not much of friends (at least on Windows). In that case you will have to run the command from shell calling `python3 pipeline.py -p months/september-october_urls.pickle -t 100 --download_process_count 2 --classification_process_count 4 -o results/september-october_urls_result.json >> log.txt &` (again assuming you are in directory where the script is located). This will run the script on background and thus if you want to see the live updates of log, run `less +F log.txt`.

In [None]:
%run ./pipeline.py -p months/september-october_urls.pickle -t 100 --download_process_count 2 --classification_process_count 4 -o results/september-october_urls_result.json

After the script finishes you should see that result_september-october_urls.json got created. We will use it in the next step.

##### Pipeline - 2. Parallelized Download & Classification - Speed considerations

We have run the computation on a laptop and on 3 different instances of Amazon AWS EC2. The reason for this was an attempt to get results as fast as possible. However, a nice side efect of this is that we are able to analyse speed improvemnts due to parallelization.

  Machine       | CPU cores    | Internet connection | # download workers | # classification workers | # images processed / second
  ------------- | -------------
  Laptop Dell Latitude E7450 | 4 (threads) | 30 Mbps | 1 | 3 | ~1
  c4.8xlarge  | 36 | 10 Gbps | 8 | 24 | ~20
  m4.16xlarge | 64 | 20 Gbps | 12 | 50| ~25

We can see 20-fold performance increase when switching from laptop to Amazon AWS EC2 c4.8xlarge server, while the numer of cores has increased only 9 times. Thus, we have concluded that the number of cores is not the only factor that helped with the speed-up. We believe that it is full utilization of 20GBit ethernet connection that also played a big role in this 20-fold increase.

#### Pipeline - 3. Class / Object extraction
In this step we will extract all the different classes / objects that were found in our images. We will create list from them in descending order by sum of confidences. This will help us in the following step.

To do so we will run [extract_classes.py](https://github.com/korcek-juraj/epfl-ada16-project/blob/master/Instagram-Classification/extract_classes.py) script.

In [None]:
# %load ./extract_classes.py
import argparse
import json
import errno
import os


def extract_classes(results_filename, classes_filename):
    """
    Reads results JSON file line by line and extracts classes / objects found in every image, i.e., it does groupby by
    the class name while aggregating the count for the class and summing over the confidence of the particular class prediction.

    The output file is in form of JSON list of tuples (Image ID, {'count': count, 'score': score, 'weight':0}) sorted by
    highest score (sum of confidence). Weight stands for sentiment of the particular class. It is set to 0 and is supposed
    to be manually changed to values between -1 and 1, before the file can be used for sentiment calculation using
    calculate_sentiment.py script.
    :param results_filename: filename of the results JSON file to extract classes from
    :param classes_filename: filename of a JSON file where the extracted classes will be written
    :return: None
    """
    classes_dict = {}
    img_id_cache = set()
    with open(results_filename) as results_file:
        for line in results_file:
            result_dict = json.loads(line)
            img_id = next(iter(result_dict))
            if img_id not in img_id_cache:
                img_id_cache.add(img_id)
                for cls, score in result_dict[img_id].items():
                    if cls not in classes_dict:
                        classes_dict[cls] = {'weight': 0, 'score': 0, 'count': 0}
                    classes_dict[cls]['score'] += score
                    classes_dict[cls]['count'] += 1

    dir_path = os.path.dirname(classes_filename)
    try:
        os.makedirs(dir_path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(dir_path):
            pass
        else:
            raise
    with open(classes_filename, 'wt') as classes_file:
        json.dump(sorted(list(classes_dict.items()), key=lambda x: x[1]['score'], reverse=True), classes_file, sort_keys=True, indent=4, separators=(',', ': '))

    print('No. of classses: ' + str(len(classes_dict)))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-r', '--path_to_results_file', type=str, required=True)
    parser.add_argument('-c', '--output_classes_file', type=str, required=True)
    args = parser.parse_args()
    extract_classes(args.path_to_results_file, args.output_classes_file)




We just need to pass it the file obtained in previous step and a path to filename to be created. Again, to run it from shell you need to execute `python3 extract_classes.py -r results/september-october_urls_result.json  -c results/september-october_urls_classes.json`.


In [28]:
%run ./extract_classes.py -r results/september-october_urls_result.json  -c results/september-october_urls_classes.json

No. of classses: 999


This is how resulting file looks similar to:

In [None]:
[
    [
        "alp",
        {
            "count": 26,
            "score": 10.534843739587814,
            "weight": 0
        }
    ],
    [
        "lakeside, lakeshore",
        {
            "count": 33,
            "score": 5.4354036473087035,
            "weight": 0
        }
    ],
    [
        "valley, vale",
        {
            "count": 27,
            "score": 3.9698514562333003,
            "weight": 0
        }
    ],
    [
        "sunglasses, dark glasses, shades",
        {
            "count": 8,
            "score": 2.3758638575673103,
            "weight": 0
        }
    ],
    [
        "fountain",
        {
            "count": 4,
            "score": 1.854876298457384,
            "weight": 0
        }
    ],
    [
        "sunglass",
        {
            "count": 7,
            "score": 1.7299124635756016,
            "weight": 0
        }
    ],
    [
        "stage",
        {
            "count": 2,
            "score": 1.5328002572059631,
            "weight": 0
        }
    ],
    [
        "cliff, drop, drop-off",
        {
            "count": 18,
            "score": 1.527658513165079,
            "weight": 0
        }
    ],
    [
        "plate",
        {
            "count": 2,
            "score": 1.3660185933113098,
            "weight": 0
        }
    ]
]

#### Pipeline - 4. Class / Object sentiment assignment
We will use this file to set sentiment for particular objects. For example, to mountain, valley, or lakeside classes we will assign positive sentiment (1), while to face or food classes we will assign neutral sentiment (0). Additionally, if there are some negative classes, e.g. thrash or graffiti we can assign negative sentiment (-1). The fact that the file got ordered in previous step will allow us to focus on the most occuring objects / classes. For example, reulting file might look like this:

In [None]:
[
    [
        "alp",
        {
            "count": 26,
            "score": 10.534843739587814,
            "weight": 1
        }
    ],
    [
        "lakeside, lakeshore",
        {
            "count": 33,
            "score": 5.4354036473087035,
            "weight": 1
        }
    ],
    [
        "valley, vale",
        {
            "count": 27,
            "score": 3.9698514562333003,
            "weight": 1
        }
    ],
    [
        "sunglasses, dark glasses, shades",
        {
            "count": 8,
            "score": 2.3758638575673103,
            "weight": 0
        }
    ],
    [
        "fountain",
        {
            "count": 4,
            "score": 1.854876298457384,
            "weight": 1
        }
    ],
    [
        "sunglass",
        {
            "count": 7,
            "score": 1.7299124635756016,
            "weight": 0
        }
    ],
    [
        "stage",
        {
            "count": 2,
            "score": 1.5328002572059631,
            "weight": 0
        }
    ],
    [
        "cliff, drop, drop-off",
        {
            "count": 18,
            "score": 1.527658513165079,
            "weight": 1
        }
    ],
    [
        "plate",
        {
            "count": 2,
            "score": 1.3660185933113098,
            "weight": 0
        }
    ]
]

#### Pipeline - 5. Sentiment calculation
Finally, we will calculate sentiment using results file from step 2 and the classes file from step 4. 

Basically, the final sentiment of an image is computed by multiplying the confidence of each of 5 objects found in the image (this information is in results file) by the sentiment of given object defined in classes file. Then we sum the results of these 5 products and get final sentiment as a decimal. Additionally, we also compute discrete sentiment from {-1, 0 , 1} using following approach:  
    -1 if decimal sentiment is from [-1, -0.33>  
     0 if decimal sentiment is from [-0.33, 0.33]  
     1 if decimal sentiment is from <0.33, 1]  
     
This is done by script [calculate_sentiment.py](https://github.com/korcek-juraj/epfl-ada16-project/blob/master/Instagram-Classification/calculate_sentiment.py). The resulting sentiment file is in form of JSON dictionary with Instagram Image IDs as keys and dictionaries with decimal and integer sentiment as values. 

In [None]:
# %load ./calculate_sentiment.py
import argparse
import json
import errno
import os

import numpy as np


def calculate_sentiment(results_filename, classes_filename, sentiment_filename):
    """
    For every Image ID, it computes sentiment by multiplying confidence/score of classes/objects found in an image corresponding
    to the given Image ID from results JSON file (passed in parameter results_filename) with sentiment weight for particular
    class defined in classes JSON file (passed in classes_filename parameter) and summing over these products.

    The output file is in form of JSON dictionary with Image IDs being keys and dictionaries {'sent_float': float, 'sent_int': int from {-1, 0, 1}}
    being values. 'sent_int' is just 'sent_float' discretized into 3 bins:
    -1 if 'sent_float' from [-1, -0.33>
     0 if 'sent_float' from [-0.33, 0.33]
     1 if 'sent_float' from <0.33, 1]
    :param results_filename: filename of the results JSON file
    :param classes_filename: filename of the JSON file extracted with extract_classes.py script and manually changed to fit the task
    :param sentiment_filename: filename of an output JSON file containing dictionary of Instagram Image IDs and their sentiment
    :return: None
    """
    sentiment_dict = {}

    with open(classes_filename) as classes_file:
        classes_list = json.load(classes_file)
        classes_dict = {cls: attr_dict['weight'] for cls, attr_dict in classes_list}

    with open(results_filename) as results_file:
        for line in results_file:
            result_dict = json.loads(line)
            img_id = next(iter(result_dict))
            if img_id not in sentiment_dict:
                sentiment_dict[img_id] = {'sent_int': 0, 'sent_float': 0}
                for cls, score in result_dict[img_id].items():
                    sentiment_dict[img_id]['sent_float'] += score * classes_dict[cls]
                sentiment_dict[img_id]['sent_int'] = (1 if abs(sentiment_dict[img_id]['sent_float']) > 0.33 else 0) * np.sign(sentiment_dict[img_id]['sent_float'])

    dir_path = os.path.dirname(sentiment_filename)
    try:
        os.makedirs(dir_path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(dir_path):
            pass
        else:
            raise
    with open(sentiment_filename, 'wt') as sentiment_file:
        json.dump(sentiment_dict, sentiment_file, sort_keys=True, indent=4, separators=(',', ': '))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-r', '--path_to_results_file', type=str, required=True)
    parser.add_argument('-c', '--path_to_classes_file', type=str, required=True)
    parser.add_argument('-o', '--output_sentiment_file', type=str, required=True)
    args = parser.parse_args()
    calculate_sentiment(args.path_to_results_file, args.path_to_classes_file, args.output_sentiment_file)


Now, we will run the script on the files obtained in steps 2 and 4 to get final sentiments. If you are in shell run `python3 calculate_sentiment.py -r results/september-october_urls_result.json -c results/september-october_urls_classes.json -o results/september-october_urls_sentiment.json` instead of following cell.

In [32]:
%run ./calculate_sentiment.py -r results/september-october_urls_result.json -c results/september-october_urls_classes.json -o results/september-october_urls_sentiment.json

Resulting results/september-october_urls_sentiment.json sentiment file should look similar to this:

In [None]:
{
    "BJ-YG36jyVm": {
        "sent_float": 0.0,
        "sent_int": 0.0
    },
    "BJ04ACTAymo": {
        "sent_float": 0.613392760977149,
        "sent_int": 1.0
    },
    "BJ05hBchWmP": {
        "sent_float": 0.0,
        "sent_int": 0.0
    },
    "BJ0Pc3CBq5H": {
        "sent_float": 0.5774869928136468,
        "sent_int": 1.0
    },
    "BJ0Zf51BapC": {
        "sent_float": 0.0,
        "sent_int": 0.0
    },
    "BJ0amnNhpCu": {
        "sent_float": 0.8761984705924988,
        "sent_int": 1.0
    },
    "BJ0o9YtBgf4": {
        "sent_float": 0.30901558231562376,
        "sent_int": 0.0
    },
    "BJ0yLsrg1K-": {
        "sent_float": 0.0,
        "sent_int": 0.0
    },
    "BJ1AUYmjyZj": {
        "sent_float": 0.9857816100120544,
        "sent_int": 1.0
    },
    "BJ1bUPfAnB_": {
        "sent_float": 0.0,
        "sent_int": 0.0
    },
    "BJ2Je6BjwQE": {
        "sent_float": 0.0,
        "sent_int": 0.0
    },
    "BJ2JujdA8QL": {
        "sent_float": 0.0,
        "sent_int": 0.0
    },
    "BJ2OQ2SBE_C": {
        "sent_float": 0.8790866502095014,
        "sent_int": 1.0
    },
    "BJ2tSflAgbZ": {
        "sent_float": 0.13510681688785553,
        "sent_int": 0.0
    }
}

### Visualization
The next step would be vizualization of sentiment throughout the Switzerland. For more details on that look into [/GeoVis/](https://github.com/korcek-juraj/epfl-ada16-project/tree/master/GeoVis) folder.

### Useful bash commands
`ssh -i amazonPublicKey.pem ec2-user@ec2-34-249-163-22.eu-west-1.compute.amazonaws.com`  
`scp -i amazonPublicKey.pem  pipeline.py ec2-user@ec2-34-249-163-22.eu-west-1.compute.amazonaws.com:~/.`  
`less +F filename`  
`wc -l filename`  
`ps aux | grep python3`  
`pkill -f python3`

### Sugesstions for improvement