Created on Mon Jan 6 09:58:13 2020
<br>
Group 7
<br>
@authors: All group members
<h1>Group 7 - Images sociales<span class="tocSkip"></span>
    
<br>  
<center>Part 2: ImagesStats

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Environment" data-toc-modified-id="Environment-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Environment</a></span><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#Parameters" data-toc-modified-id="Parameters-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Parameters</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Functions</a></span></li></ul></li><li><span><a href="#SeatGuru" data-toc-modified-id="SeatGuru-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>SeatGuru</a></span><ul class="toc-item"><li><span><a href="#Create-DataFrame" data-toc-modified-id="Create-DataFrame-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Create DataFrame</a></span></li><li><span><a href="#Manually-added-labellisation" data-toc-modified-id="Manually-added-labellisation-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Manually added labellisation</a></span></li><li><span><a href="#Descriptive-statistics" data-toc-modified-id="Descriptive-statistics-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Descriptive statistics</a></span></li></ul></li><li><span><a href="#Instagram" data-toc-modified-id="Instagram-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Instagram</a></span></li></ul></div>

# Introduction
In this notebook, you will find the code to perform basic statistics on the 2 sets of images given by Airbus for prediction. Our goal was to retreive information about the amount of data, images format, and relevant labellisation when possible.

**Instagram**<span class="tocSkip"></span><br>
Output: a `CSV` containing information regarding images formats: format, height, width, number of colours in the colour model (here, RGB). In addition, some basic statistics (number of images, number of different formats) are displayed.

**SeatGuru**<span class="tocSkip"></span><br>
Since SeatGuru images were provided with some labellisation in their title, and knowing that our goal was to make predictions on similar social media images, we decided to include them in some of our train and test sets.
The following code therefore includes extraction of `aircraft_manufacturer` and `aircraft_type` labels from images titles.
We also present you the method we used to manually add `View` labels. Since we already performed this task, you can leave `manually_annotate_SeatGuru` set to `False` and just run the following code blocks to retrieve the labels we provide you.

Output: a `CSV` containing information regarding:
* Images formats;
* Aircraft manufacturer and type;
* Viewpoint: interior (Int), exterior (Ext), exterior viewed from a window (Ext_Int), meal tray (Meal).

In addition, some basic statistics (number of images, number of different formats) are displayed, along with the conclusions drawn towards deep learing models: we will see that in many cases, data augmentation or use of other sources of images will be more than necessary.

# Environment
To ensure a proper functioning of this code file, `python 3.6` or later version is required.
## Libraries

In [1]:
import os
import numpy as np
import pandas as pd
from matplotlib.pyplot import imread
from PIL import Image

In [2]:
%load_ext watermark
%watermark -p numpy,pandas,matplotlib,PIL

numpy 1.17.4
pandas 0.25.2
matplotlib 3.1.1
PIL 6.2.0


## Parameters

In [3]:
project_path = './../'
seatguru_path = project_path + 'Interpromo2020/All Data/ANALYSE IMAGE/IMG SEATGURU/'
insta_path = project_path + 'Interpromo2020/All Data/ANALYSE IMAGE/INSTAGRAM/'
stats_path = project_path + 'ImagesStats/'
annot_path = project_path + 'IMG_annot/SEATGURU/'

# True if you want to add newly labelled images
manually_annotate_SeatGuru = False

## Functions

In [4]:
def get_imgs_formats(imgs: list, imgs_df: pd.DataFrame) -> pd.DataFrame:
    """Fills the DataFrame with pictures info (format and dimensions).

    Parameters:
    imgs: list of arrays representing images
    imgs_df: empty DataFrame with at least a 'name' column

    Out:
    imgs_df: DataFrame with 4 new columns: 'format', 'height', 'width',
             'height_to_width', 'ncol', and one line per image

    """

    # Formats
    imgs_df['format'] = imgs_df.name.apply(lambda x: x.split('.')[1])

    # Shapes
    heights = [imgs[k].shape[0] if len(
        imgs[k].shape) != 0 else 0 for k in range(len(imgs))]
    widths = [imgs[k].shape[1] if len(
        imgs[k].shape) != 0 else 0 for k in range(len(imgs))]
    ncols = [imgs[k].shape[2] if len(
        imgs[k].shape) != 0 else 0 for k in range(len(imgs))]

    imgs_df['height'] = heights
    imgs_df['width'] = widths
    imgs_df['height_to_width'] = [x / y if y != 0 else 0 for x,
                                  y in zip(imgs_df.height, imgs_df.width)]
    imgs_df['ncol'] = ncols

    return imgs_df


def get_stats(imgs_df: pd.DataFrame, col: pd.Series, col_name: str):
    """Basic statistics on a given DataFrame column."""

    print(f'Max {col_name}: {round(np.max(col), 4)}')
    print(f'Median {col_name}: {round(np.median(col), 4)}')
    print(f'Min {col_name}: {round(np.min(col), 4)}')
    print('')


def data_insta(insta_path: str, hashtags_list: list, stats_path: str) -> pd.DataFrame:
    """
    Paramaters:
        insta_path: path to Instagram folders 
        hashtag: chosen hashtag (a folder has to be named after this hashtag)

    Out:
        df: DataFrame
    """

    df_all = pd.DataFrame(
        columns=['hashtag', 'name', 'format', 'height', 'width', 'height_to_width', 'ncol'])

    for h in hashtags_list:
        img_list = os.listdir(insta_path + '/' + str(h))
        img_list = [img for img in img_list if '.xlsx' not in img]

        # List of all matrices (images)
        imgs = list()
        for img in img_list:
            img = imread(insta_path + '/' + str(h) + '/' + img)
            imgs.append(img)

        # Init the dataframe that will contain all basic info about images
        imgs_df = pd.DataFrame(
            columns=['hashtag', 'name', 'format', 'height', 'width', 'height_to_width', 'ncol'])
        imgs_df.name = img_list
        imgs_df.hashtag = h

        # Apply function to fill in DataFrame
        imgs_df = get_imgs_formats(imgs, imgs_df)

        df_all = pd.concat([df_all, imgs_df], axis=0)

    # Save as csv
    os.makedirs(stats_path, exist_ok=True)
    df_all.to_csv(path_or_buf=stats_path + 'g7_INSTAGRAM.csv',
                  sep=';', encoding='utf-8', index=False)

    return df_all

# SeatGuru

In [5]:
img_list = os.listdir(seatguru_path)
nb_images = len(img_list)
print(f'{nb_images} SeatGuru images')

2556 SeatGuru images


## Create DataFrame

In [6]:
# List of all matrices (images)
seatguru_imgs = list()
for i in range(nb_images):
    img = imread(seatguru_path + img_list[i])
    seatguru_imgs.append(img)

# Init the dataframe that will contain all basic info about images
seatguru_df = pd.DataFrame(columns=['name', 'format', 'height', 'width',
                                    'height_to_width', 'ncol', 'aircraft_manufacturer', 'aircraft_type'])
seatguru_df.name = img_list

# Apply function to fill in DataFrame
seatguru_df = get_imgs_formats(seatguru_imgs, seatguru_df)

# Aircraft manufacturers
aircraft_manufacturers = ['Airbus' if 'Airbus' in seatguru_df.name[k]
                          else 'Boeing' if 'Boeing' in seatguru_df.name[k]
                          else 'Other' for k in range(len(seatguru_df))]

# Aircraft types
aircraft_types = [name.split('_')[name.split('_').index(aircraft_manufacturer) + 1].split('-')[0]
                  if aircraft_manufacturer in name.split('_') else ''
                  for name, aircraft_manufacturer in zip(seatguru_df.name, aircraft_manufacturers)]

# Add missing As for Airbus aircrafts
aircraft_types = [aircraft_types[k]
                  if ('A' in aircraft_types[k] or '7' in aircraft_types[k] or aircraft_types[k] == '')
                  else 'A' + aircraft_types[k] for k in range(len(aircraft_types))]

# Remove Airbus 'neos' because we don't need that much detail
aircraft_types = [aircraft_types[k].replace(
    'neo', '') for k in range(len(aircraft_types))]

# Remove Ds
aircraft_types = [aircraft_types[k].replace(
    'D', '') for k in range(len(aircraft_types))]

# Fill in dataframe columns
seatguru_df.aircraft_manufacturer = aircraft_manufacturers
seatguru_df.aircraft_type = aircraft_types

In [7]:
seatguru_df.head()

Unnamed: 0,name,format,height,width,height_to_width,ncol,aircraft_manufacturer,aircraft_type
0,Aegean_Airlines_Airbus_A320-200_0.jpg,jpg,720,720,1.0,3,Airbus,A320
1,Aegean_Airlines_Airbus_A320-200_1.jpg,jpg,720,720,1.0,3,Airbus,A320
2,Aegean_Airlines_Airbus_A320-200_2.jpg,jpg,720,720,1.0,3,Airbus,A320
3,Aegean_Airlines_Airbus_A320-200_3.jpg,jpg,720,960,0.75,3,Airbus,A320
4,Aegean_Airlines_Airbus_A320-200_4.jpg,jpg,720,540,1.333333,3,Airbus,A320


## Manually added labellisation
The first model of our final pipeline (see Part 3. of `README.md`) classifies images into 4 View categories: interior (Int), exterior (Ext), exterior viewed from a window (Ext_Int), meal tray (Meal).
Such labels were not provided with our images; we decided to label them manually because discriminating, at least, Exterior vs. Interior was the obvious thing to do before applying models specifically designed for each of these 2 categories.

This is the method we used for manual labellisation:
* Make sure you have 4 folders in `./../IMG_annot/SEATGURU/` named `Ext`, `Int`, `Ext_Int`, and `Meal`;
* Go through `IMG SEATGURU` folder; copy relevant images to the previously mentionned folder which corresponds to the label you want to put on your images;
* Set `manually_annotate_SeatGuru` to `True` and run the following code blocks which: create a DataFrame with all 4 labels, merge it with the DataFrame created in Part 3.1 of this notebook, and save to `CSV`.

Since we already performed this task, you can leave `manually_annotate_SeatGuru` set to `False` and just run the following code blocks to retrieve the labels we provide you.

In [8]:
if manually_annotate_SeatGuru:
    # Add your new labels
    labels = os.listdir(annot_path)  # ['Int', 'Ext', 'Ext_Int', 'Meal']

    # Create and fill labels DafaFrame
    labelled_df = pd.DataFrame(columns=['name', 'view'])

    for label in labels:
        label_path = annot_path + label
        img_list = os.listdir(label_path)
        df = pd.DataFrame(columns=['name', 'view'])
        df['name'] = img_list
        df['view'] = label
        labelled_df = pd.concat([labelled_df, df], axis=0)

else:
    # Simply get the annotated CSV we provide you
    labelled_df = pd.read_csv(stats_path + 'g7_SEATGURU.csv', sep=';')
    labelled_df = labelled_df[['name', 'view']]

In [9]:
# Merge with SeatGuru DataFrame from Part 3.1 of this notebook
seatguru_df = seatguru_df.merge(labelled_df, on='name', how='outer')

# Sort
seatguru_df = seatguru_df.sort_values(by=['aircraft_manufacturer', 'aircraft_type', 'view'],
                                      ascending=[True, True, True])

seatguru_df = seatguru_df.reindex(pd.Index(['aircraft_manufacturer', 'aircraft_type',
                                            'view', 'name', 'format', 'height', 'width',
                                            'height_to_width', 'ncol']), axis=1).reset_index(drop=True)

seatguru_df.head()

Unnamed: 0,aircraft_manufacturer,aircraft_type,view,name,format,height,width,height_to_width,ncol
0,Airbus,A220,Int,Delta_Airlines_DL_Airbus_A220-100_0.jpg,jpg,720,960,0.75,3
1,Airbus,A310,Ext,Air_Transat_Airbus_A310-300_1.jpg,jpg,640,960,0.666667,3
2,Airbus,A310,Int,Air_Transat_Airbus_A310-300_0.jpg,jpg,717,960,0.746875,3
3,Airbus,A318,Ext,Air_France_Airbus_A318_A_0.jpg,jpg,717,960,0.746875,3
4,Airbus,A318,Ext,Air_France_Airbus_A318_A_1.jpg,jpg,720,720,1.0,3


In [10]:
# Save as csv
os.makedirs(stats_path, exist_ok=True)
seatguru_df.to_csv(path_or_buf=stats_path + 'g7_SEATGURU.csv',
                   sep=';', encoding='utf-8', index=False)

## Descriptive statistics

In [11]:
print(f'{len(np.unique(seatguru_df.format))} unique image format(s).')
print(f'{len(np.unique(seatguru_df.ncol))} unique ncol(s).')
print(f'{len(np.unique(seatguru_df.height_to_width))} unique height_to_width(s).')
print(f'{len(np.unique(seatguru_df.aircraft_type))} unique aircraft type(s).')

1 unique image format(s).
1 unique ncol(s).
128 unique height_to_width(s).
20 unique aircraft type(s).


In [12]:
# Number of images per aircraft manufacturer
print(
    f'{len(np.arange(nb_images)[seatguru_df.aircraft_manufacturer == "Airbus"])} Airbus labelled images.')
print(
    f'{len(np.arange(nb_images)[seatguru_df.aircraft_manufacturer == "Boeing"])} Boeing labelled images.')
print(
    f'{len(np.arange(nb_images)[seatguru_df.aircraft_manufacturer == "Other"])} others.')

1043 Airbus labelled images.
1112 Boeing labelled images.
401 others.


In [13]:
# Remove 'Others' categories because we won't use them to train our models
seatguru_df = seatguru_df[seatguru_df.aircraft_manufacturer != 'Other']
seatguru_df = seatguru_df[seatguru_df.view != 'Others']

In [14]:
# Dimensions
get_stats(imgs_df=seatguru_df, col=seatguru_df.height_to_width,
          col_name='height_to_width')

get_stats(imgs_df=seatguru_df, col=seatguru_df.height,
          col_name='height')

get_stats(imgs_df=seatguru_df, col=seatguru_df.width,
          col_name='width')

Max height_to_width: 2.1365
Median height_to_width: 1.0
Min height_to_width: 0.276

Max height: 720
Median height: 720.0
Min height: 265

Max width: 960
Median width: 720.0
Min width: 337



In [15]:
# Number of images per aircraft manufacturer and type
pvt = pd.DataFrame(pd.pivot_table(seatguru_df,
                                  index=['aircraft_manufacturer',
                                         'aircraft_type'],
                                  aggfunc='count').format.sort_values(ascending=False))

pvt.sort_values(by=['aircraft_manufacturer', 'format'],
                ascending=[True, False])

Unnamed: 0_level_0,Unnamed: 1_level_0,format
aircraft_manufacturer,aircraft_type,Unnamed: 2_level_1
Airbus,A330,324
Airbus,A320,238
Airbus,A321,148
Airbus,A319,105
Airbus,A380,70
Airbus,A340,63
Airbus,A350,46
Airbus,A318,13
Airbus,A346,5
Airbus,A310,2


In [16]:
# Number of images per aircraft manufacturer and viewpoint
pvt = pd.DataFrame(pd.pivot_table(seatguru_df,
                                  index=['view'],
                                  columns=['aircraft_manufacturer'],
                                  aggfunc='count', margins=True).format)
pvt

aircraft_manufacturer,Airbus,Boeing,All
view,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ext,96,96,192
Ext_Int,79,64,143
Int,766,839,1605
Meal,74,86,160
All,1015,1085,2100


We can draw the following conclusions from these SeatGuru statistics:
* We don't have enough Exterior images, that's why we will necessarily use images scrapped from Airliners and Google Images to train our `Ext_man` model;
* The images are unevenly distributed between our target categories, and some of them contain very few images. e.g.: 342 images of Boeing 777 vs. 55 of Boeing 757. For Airbus interiors, we have at our disposal Hackathon images; for Boeing, we will have to use data augmentation if our model underperforms.

# Instagram

In [17]:
hashtags_list = os.listdir(insta_path)
nb_hashtags = len(hashtags_list)
print(f'{nb_hashtags} Instagram folders (1 folder per hashtag): \n{hashtags_list}')

4 Instagram folders (1 folder per hashtag): 
['airbus', 'aircraftinterior', 'aircraftseat', 'boeing']


In [18]:
insta_df = data_insta(insta_path=insta_path,
                      hashtags_list=hashtags_list, stats_path=stats_path)

In [19]:
insta_df.head()

Unnamed: 0,hashtag,name,format,height,width,height_to_width,ncol
0,airbus,0.jpg,jpg,512,512,1.0,3
1,airbus,1.jpg,jpg,512,768,0.666667,3
2,airbus,10.jpg,jpg,943,1080,0.873148,3
3,airbus,100.jpg,jpg,757,1080,0.700926,3
4,airbus,1000.jpg,jpg,899,1024,0.87793,3


In [20]:
print(f'{len(insta_df)} images:')

for h in hashtags_list:
    print(f'#{h}: {len(insta_df.hashtag[insta_df.hashtag == h])} images')

print(f'\n{len(np.unique(insta_df.format))} unique image format(s).')
print(f'{len(np.unique(insta_df.ncol))} unique ncol(s).')
print(f'{len(np.unique(insta_df.height_to_width))} unique height_to_width(s). \n')

5820 images:
#airbus: 1976 images
#aircraftinterior: 767 images
#aircraftseat: 1143 images
#boeing: 1934 images

1 unique image format(s).
2 unique ncol(s).
705 unique height_to_width(s). 

