In [None]:
import json
from functools import partial
import ast

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import collections  as mc
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

When asked to draw a simple shape, do you usually draw it clockwise or counterclockwise? Does it depend on the shape in question? Does it depend on your cultural background? These are the questions we rarely ask ourselves, but can potentially lead to interesting discoveries. Maybe before reading this notebook, try this yourself. Draw a circle, square and hexagon each, and see instinctively, which direction your hand is going, and then find out below if you draw the same way the majority of people from your region do.

To determine how people from different parts of the world draw simple shapes, we will be using the QuickDraw dataset, which is a collection of various simple shapes people drew online when asked to do a sketch of a given object. Here we are most interested in how people draw the most basic shapes (circle, square, hexagon). The dataset includes the country code of the user as an attribute, allowing us to perform some geography / culture-based analysis on our results.

The goal of our analysis is to find out if there are any differences in the direction people draw simple shapes around the world. To do so, we will have to come up with a measure of 'counterclockwiseness' for each of the sktches in the dataset. But first, let us look at in what format the drawings appear in the dataset. Let us load the hexagon drawings first:

In [None]:
hex_df = pd.read_csv('../input/train_simplified/hexagon.csv')
hex_df['drawing'] = hex_df['drawing'].apply(ast.literal_eval)

In [None]:
hex_df.head()

We see from below that the drawings are stored as lists of continous strokes, where each stroke is further broken down into line segments and stored as [(x1, x2, x3,...), (y1, y2, y3...)]. We have to remember that image coordinates are different from typical coordinate systems in that the y axis points downwards (so (1, 1) is at the top left corner not bottom left corner). It is easier to process and visualise the data if we convert it to the ordinary coordinate format by setting y = 255 - y.

In [None]:
hex_df.loc[0, 'drawing']

## Hexagons:

Let us first visualise a few examples from the hexagon drawings to see what a typical sketch looks like.

In [None]:
def to_line_collection(stroke):
    points = list(zip(stroke[0], [255 - x for x in stroke[1]]))
    lc = [[points[i], points[i + 1]] for i in range(len(points) - 1)]
    return mc.LineCollection(lc, linewidth=2)

In [None]:
def visualise_drawing(drawing, ax):
    for stroke in drawing:
        ax.add_collection(to_line_collection(stroke))
    ax.autoscale()
    return ax

In [None]:
f, ax = plt.subplots(2, 5, figsize=(16, 6))
for i in range(10):
    visualise_drawing(hex_df.loc[i, 'drawing'], ax=ax[i//5, i%5])
    ax[i//5, i%5].axis('off')
plt.show()

#### Define clockwise / counterclockwise stroke
Now we need to find a measure for 'counterclockwiseness'. How do we analyse something so abstract? Intuitively, we know that something goes clockwise if it keeps turning right, and counterclockwise if it keeps going left. Therefore, it would be a good start to compare each segment in a stroke to see if it is going left or right compared to the last segment. To do so, we need to separate segments from a stroke and compare the angles between consecutive segments. However, the angle alone would likely not help us much, as we will then only be counting how many 'loops' there are in the curves drawn, regardless of whether it is a tiny turn, a minor jitter or an entire cirlce around the drawing area. We need to take into account the relative lengths of each line segments. For each turn, we need to know how much it turned left or right compared to the last line segment. The sine of the turn angle is a good candidate for this, as it can be interpreted as the sideways component of the second segment divided by the length of the first segment. It is still not perfect (e.g. a small turn has the same score as a large turn if the length proportions are the same), but it has a few nice properties, such as:

* It can be both positive and negative, and when summed up, equivalent turns in opposite directions cancel each other out
* It is resistant to tiny jitters, so a brief change in stroke direction will not greatly impact the net score
* It is bounded between -1 and 1, so one single turn cannot dominate the score of the whole drawing

To put it simply, **postitive score -> counterclockwise preference, negative score -> clockwise preference**.

Here we calculate the sine score for each turn in the drawings, then sum up the scores by drawings.

In [None]:
def invert_y(strokes):
    strokes[:, 1] = 255 - strokes[:, 1]
    return strokes

def decompose_drawing(drawing):
    strokes = [invert_y(np.array(stroke).T) for stroke in drawing]
    return strokes

In [None]:
def relative_turn_distance(three_points):
    vector_1 = three_points[1, :] - three_points[0, :]
    vector_2 = three_points[2, :] - three_points[1, :]
    distance = np.cross(vector_1, vector_2) / (np.linalg.norm(vector_1) * np.linalg.norm(vector_2))
    return distance if not np.isnan(distance) else 0.

In [None]:
def stroke_relative_turn_distance(stroke):
    if stroke.shape[0] < 3:
        distance = 0
    else:
        distance = np.sum([relative_turn_distance(stroke[start: start + 3, :]) for start in range(0, stroke.shape[0] - 2)])
    return distance

In [None]:
def drawing_relative_turn_distance(drawing, connect=False):
    strokes = decompose_drawing(drawing)
    if connect:
        distance = stroke_relative_turn_distance(np.concatenate(strokes, axis=0))
    else:
        distance = np.sum([stroke_relative_turn_distance(stroke) for stroke in strokes])
    return distance

When performing the sum of scores for each drawing, we can treat each strokes separately or treat then as if they were chained to each other head to tail (so the change in position from stroke 1 to stroke 2 is also considered a line segment for the purpose of the analysis). We will mainly be doing our analysis using the 'chained' approach, but the analysis using 'per stoke' produces similar results.

In [None]:
# per_stroke_score = hex_df['drawing'].apply(partial(drawing_relative_turn_distance, connect=False))
per_drawing_score = hex_df['drawing'].apply(partial(drawing_relative_turn_distance, connect=True))

In [None]:
# hex_df['sum_per_stroke_score'] = per_stroke_score
hex_df['score'] = per_drawing_score

In [None]:
hex_df.head()

Let us first look at the most 'unusual' drawings based on the score, from both end of the scale:

In [None]:
f, ax = plt.subplots(2, 5, figsize=(16, 6))
ordered_subset = hex_df.sort_values('score').iloc[:5, :]
for i, drawing in enumerate(ordered_subset['drawing']):
    visualise_drawing(drawing, ax=ax[0, i])
    ax[0, i].axis('off')
ordered_subset = hex_df.sort_values('score').iloc[-5:, :]
for i, drawing in enumerate(ordered_subset['drawing']):
    visualise_drawing(drawing, ax=ax[1, i])
    ax[1, i].axis('off')
plt.show()

As we see, the most extreme cases are typically random drawings that do not make much sense, so it is likely safe to exclude the most extreme cases in our analysis later. As we see below, most of the score values fall between -5 and 5.

In [None]:
hex_df.describe()

Now let us look at the drawings that are closest to the median score of the drawings. As we see below, having a median score for the 'counterclockwiseness' does not necessarily mean having a drawing closest to a standard shape.

In [None]:
f, ax = plt.subplots(2, 5, figsize=(16, 6))
temp = hex_df.copy()
temp['dev'] = np.abs(temp['score'] - temp['score'].median())
ordered_subset = temp.sort_values('dev', ascending=True).iloc[:10, :]
for i, drawing in enumerate(ordered_subset['drawing']):
    visualise_drawing(drawing, ax=ax[i//5, i%5])
    ax[i//5, i%5].axis('off')
plt.show()

Let us get back to our original question. Can we expect to observe different patterns from different parts of the world in terms of drawing counterclockwiseness?

In [None]:
country_codes = '''
Country Name;ISO 3166-1-alpha-2 code
AFGHANISTAN;AF
臠AND ISLANDS;AX
ALBANIA;AL
ALGERIA;DZ
AMERICAN SAMOA;AS
ANDORRA;AD
ANGOLA;AO
ANGUILLA;AI
ANTARCTICA;AQ
ANTIGUA AND BARBUDA;AG
ARGENTINA;AR
ARMENIA;AM
ARUBA;AW
AUSTRALIA;AU
AUSTRIA;AT
AZERBAIJAN;AZ
BAHAMAS;BS
BAHRAIN;BH
BANGLADESH;BD
BARBADOS;BB
BELARUS;BY
BELGIUM;BE
BELIZE;BZ
BENIN;BJ
BERMUDA;BM
BHUTAN;BT
BOLIVIA, PLURINATIONAL STATE OF;BO
BONAIRE, SINT EUSTATIUS AND SABA;BQ
BOSNIA AND HERZEGOVINA;BA
BOTSWANA;BW
BOUVET ISLAND;BV
BRAZIL;BR
BRITISH INDIAN OCEAN TERRITORY;IO
BRUNEI DARUSSALAM;BN
BULGARIA;BG
BURKINA FASO;BF
BURUNDI;BI
CAMBODIA;KH
CAMEROON;CM
CANADA;CA
CAPE VERDE;CV
CAYMAN ISLANDS;KY
CENTRAL AFRICAN REPUBLIC;CF
CHAD;TD
CHILE;CL
CHINA;CN
CHRISTMAS ISLAND;CX
COCOS (KEELING) ISLANDS;CC
COLOMBIA;CO
COMOROS;KM
CONGO;CG
CONGO, THE DEMOCRATIC REPUBLIC OF THE;CD
COOK ISLANDS;CK
COSTA RICA;CR
C訲E D'IVOIRE;CI
CROATIA;HR
CUBA;CU
CURA茿O;CW
CYPRUS;CY
CZECH REPUBLIC;CZ
DENMARK;DK
DJIBOUTI;DJ
DOMINICA;DM
DOMINICAN REPUBLIC;DO
ECUADOR;EC
EGYPT;EG
EL SALVADOR;SV
EQUATORIAL GUINEA;GQ
ERITREA;ER
ESTONIA;EE
ETHIOPIA;ET
FALKLAND ISLANDS (MALVINAS);FK
FAROE ISLANDS;FO
FIJI;FJ
FINLAND;FI
FRANCE;FR
FRENCH GUIANA;GF
FRENCH POLYNESIA;PF
FRENCH SOUTHERN TERRITORIES;TF
GABON;GA
GAMBIA;GM
GEORGIA;GE
GERMANY;DE
GHANA;GH
GIBRALTAR;GI
GREECE;GR
GREENLAND;GL
GRENADA;GD
GUADELOUPE;GP
GUAM;GU
GUATEMALA;GT
GUERNSEY;GG
GUINEA;GN
GUINEA-BISSAU;GW
GUYANA;GY
HAITI;HT
HEARD ISLAND AND MCDONALD ISLANDS;HM
HOLY SEE (VATICAN CITY STATE);VA
HONDURAS;HN
HONG KONG;HK
HUNGARY;HU
ICELAND;IS
INDIA;IN
INDONESIA;ID
IRAN, ISLAMIC REPUBLIC OF;IR
IRAQ;IQ
IRELAND;IE
ISLE OF MAN;IM
ISRAEL;IL
ITALY;IT
JAMAICA;JM
JAPAN;JP
JERSEY;JE
JORDAN;JO
KAZAKHSTAN;KZ
KENYA;KE
KIRIBATI;KI
KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF;KP
KOREA, REPUBLIC OF;KR
KUWAIT;KW
KYRGYZSTAN;KG
LAO PEOPLE'S DEMOCRATIC REPUBLIC;LA
LATVIA;LV
LEBANON;LB
LESOTHO;LS
LIBERIA;LR
LIBYA;LY
LIECHTENSTEIN;LI
LITHUANIA;LT
LUXEMBOURG;LU
MACAO;MO
MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF;MK
MADAGASCAR;MG
MALAWI;MW
MALAYSIA;MY
MALDIVES;MV
MALI;ML
MALTA;MT
MARSHALL ISLANDS;MH
MARTINIQUE;MQ
MAURITANIA;MR
MAURITIUS;MU
MAYOTTE;YT
MEXICO;MX
MICRONESIA, FEDERATED STATES OF;FM
MOLDOVA, REPUBLIC OF;MD
MONACO;MC
MONGOLIA;MN
MONTENEGRO;ME
MONTSERRAT;MS
MOROCCO;MA
MOZAMBIQUE;MZ
MYANMAR;MM
NAMIBIA;NA
NAURU;NR
NEPAL;NP
NETHERLANDS;NL
NEW CALEDONIA;NC
NEW ZEALAND;NZ
NICARAGUA;NI
NIGER;NE
NIGERIA;NG
NIUE;NU
NORFOLK ISLAND;NF
NORTHERN MARIANA ISLANDS;MP
NORWAY;NO
OMAN;OM
PAKISTAN;PK
PALAU;PW
PALESTINE, STATE OF;PS
PANAMA;PA
PAPUA NEW GUINEA;PG
PARAGUAY;PY
PERU;PE
PHILIPPINES;PH
PITCAIRN;PN
POLAND;PL
PORTUGAL;PT
PUERTO RICO;PR
QATAR;QA
R蒛NION;RE
ROMANIA;RO
RUSSIAN FEDERATION;RU
RWANDA;RW
SAINT BARTH蒐EMY;BL
SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA;SH
SAINT KITTS AND NEVIS;KN
SAINT LUCIA;LC
SAINT MARTIN (FRENCH PART);MF
SAINT PIERRE AND MIQUELON;PM
SAINT VINCENT AND THE GRENADINES;VC
SAMOA;WS
SAN MARINO;SM
SAO TOME AND PRINCIPE;ST
SAUDI ARABIA;SA
SENEGAL;SN
SERBIA;RS
SEYCHELLES;SC
SIERRA LEONE;SL
SINGAPORE;SG
SINT MAARTEN (DUTCH PART);SX
SLOVAKIA;SK
SLOVENIA;SI
SOLOMON ISLANDS;SB
SOMALIA;SO
SOUTH AFRICA;ZA
SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS;GS
SOUTH SUDAN;SS
SPAIN;ES
SRI LANKA;LK
SUDAN;SD
SURINAME;SR
SVALBARD AND JAN MAYEN;SJ
SWAZILAND;SZ
SWEDEN;SE
SWITZERLAND;CH
SYRIAN ARAB REPUBLIC;SY
TAIWAN;TW
TAJIKISTAN;TJ
TANZANIA, UNITED REPUBLIC OF;TZ
THAILAND;TH
TIMOR-LESTE;TL
TOGO;TG
TOKELAU;TK
TONGA;TO
TRINIDAD AND TOBAGO;TT
TUNISIA;TN
TURKEY;TR
TURKMENISTAN;TM
TURKS AND CAICOS ISLANDS;TC
TUVALU;TV
UGANDA;UG
UKRAINE;UA
UNITED ARAB EMIRATES;AE
UNITED KINGDOM;GB
UNITED STATES;US
UNITED STATES MINOR OUTLYING ISLANDS;UM
URUGUAY;UY
UZBEKISTAN;UZ
VANUATU;VU
VENEZUELA, BOLIVARIAN REPUBLIC OF;VE
VIET NAM;VN
VIRGIN ISLANDS, BRITISH;VG
VIRGIN ISLANDS, U.S.;VI
WALLIS AND FUTUNA;WF
WESTERN SAHARA;EH
YEMEN;YE
ZAMBIA;ZM
ZIMBABWE;ZW
'''
from io import StringIO
country_codes_df = pd.read_csv(StringIO(country_codes), sep=';')
country_codes_df.columns = ['countryname', 'countrycode']

In [None]:
country_codes_df.head()

In [None]:
hex_df = pd.merge(hex_df, country_codes_df, on='countrycode', how='left')
top_countries = hex_df['countrycode'].value_counts()

In [None]:
f, ax = plt.subplots(figsize=(14, 14))
subset = hex_df[hex_df['countrycode'].isin(top_countries.index[:40]) & (np.abs(hex_df['score'] < 10))]
sub_order = subset.groupby('countryname')['score'].mean().sort_values().index
sns.barplot(data=subset, y='countryname', x='score', order=sub_order)
plt.show()

What a surprise! While we see that most user around the world draw a hexagon clockwise (negative score), the average user from Japan actually draws the shape counterclockwise. Also, two nearby countries, Thailand and Viet Nam, share the top two spots on the clockwiseness scale. We do havr to take into account sample size though, as these two countries' score have a rather high variance.

## Circles:

In [None]:
circle_df = pd.read_csv('../input/train_simplified/circle.csv')
circle_df['drawing'] = circle_df['drawing'].apply(ast.literal_eval)

In [None]:
f, ax = plt.subplots(2, 5, figsize=(16, 6))
for i in range(10):
    visualise_drawing(circle_df.loc[i, 'drawing'], ax=ax[i//5, i%5])
    ax[i//5, i%5].axis('off')
plt.show()

In [None]:
circle_df['score'] = circle_df['drawing'].apply(partial(drawing_relative_turn_distance, connect=True))
circle_df = pd.merge(circle_df, country_codes_df, on='countrycode', how='left')
top_countries = circle_df['countrycode'].value_counts()

In [None]:
f, ax = plt.subplots(figsize=(14, 14))
subset = circle_df[circle_df['countrycode'].isin(top_countries.index[:40]) & (np.abs(circle_df['score'] < 10))]
sub_order = subset.groupby('countryname')['score'].mean().sort_values().index
sns.barplot(data=subset, y='countryname', x='score', order=sub_order)
plt.show()

Another purprise! This time, most users around the world tend to draw circles counterclockwise, except Taiwan and (again!) Japan, whose user typically draw it clockwise. We noticed that even for simple shapes like hexagon and circles, the type of the shape still has an impact on the usual direction people draw them. What about other polygons? Are they more like hexagons or are they more like circles? Which of the two shapes above is the norm and which is the exception, or is it just different for each individual shape?

We also learned that at least for the two tasks above, the Japanese users always do things differently. I wonder why.

## Squares:

In [None]:
square_df = pd.read_csv('../input/train_simplified/square.csv')
square_df['drawing'] = square_df['drawing'].apply(ast.literal_eval)

f, ax = plt.subplots(2, 5, figsize=(16, 6))
for i in range(10):
    visualise_drawing(square_df.loc[i, 'drawing'], ax=ax[i//5, i%5])
    ax[i//5, i%5].axis('off')
plt.show()

In [None]:
square_df['score'] = square_df['drawing'].apply(partial(drawing_relative_turn_distance, connect=True))
square_df = pd.merge(square_df, country_codes_df, on='countrycode', how='left')
top_countries = square_df['countrycode'].value_counts()

In [None]:
f, ax = plt.subplots(figsize=(14, 14))
subset = square_df[square_df['countrycode'].isin(top_countries.index[:40]) & (np.abs(square_df['score'] < 10))]
sub_order = subset.groupby('countryname')['score'].mean().sort_values().index
sns.barplot(data=subset, y='countryname', x='score', order=sub_order)
plt.show()

Here we see the results for squares. Japan is not the odd one out this time, but we see an interesting pattern here. While most users draw a square counterclockwise, people from European countries are more likely to do so, whereas shapes drawn by Asian users (including Indian and Middle Eastern) seem less likely so. Maybe people from different parts of the world just have specific preferences for each of the common shapes. What about a less common shape like an octagon?

## Octagon:

In [None]:
octagon_df = pd.read_csv('../input/train_simplified/octagon.csv')
octagon_df['drawing'] = octagon_df['drawing'].apply(ast.literal_eval)

f, ax = plt.subplots(2, 5, figsize=(16, 6))
for i in range(10):
    visualise_drawing(octagon_df.loc[i, 'drawing'], ax=ax[i//5, i%5])
    ax[i//5, i%5].axis('off')
plt.show()

Look at the samples above. Looks like quite a few users have no idea what an octagon looks like!

In [None]:
octagon_df['score'] = octagon_df['drawing'].apply(partial(drawing_relative_turn_distance, connect=False))
octagon_df = pd.merge(octagon_df, country_codes_df, on='countrycode', how='left')
top_countries = octagon_df['countrycode'].value_counts()

In [None]:
f, ax = plt.subplots(figsize=(14, 14))
subset = octagon_df[octagon_df['countrycode'].isin(top_countries.index[:40]) & (np.abs(octagon_df['score'] < 10))]
sub_order = subset.groupby('countryname')['score'].mean().sort_values().index
sns.barplot(data=subset, y='countryname', x='score', order=sub_order)
plt.show()

We see a much more random pattern here. It looks like the strong preferences only exist for simpler shapes that are drawn often. When it comes to less common shapes, the direction of drawing is less determined by the user's location and the variance in personal perferences become larger.

## Finally, let us compre the four shapes:

In [None]:
dfs = [square_df, hex_df, octagon_df, circle_df]
combined = pd.concat(dfs, axis=0, sort=False)
combined['n_segs'] = combined['drawing'].apply(lambda x: np.sum([len(s[0]) for s in x]))
combined['per_seg_score'] = combined['score'] / combined['n_segs']
combined.groupby('word')['n_segs'].describe()

In [None]:
combined.groupby('word')['score'].describe()

In [None]:
f, ax = plt.subplots(figsize=(14, 10))
for word in combined['word'].unique():
    sns.kdeplot(data=combined[(combined['word'] == word) & (np.abs(combined['score']) < 10)]['score'], ax=ax, label=word)
ax.set_xlabel('counterclockwiseness score')
plt.show()

We see the distributions of counterclockwiseness for each of the shapes above. We see that for almost of them, there is a divide between the clockwise users vs counterclockwise users, as seen by the bimodal shape of the distributions. However, octagons, we actually see a *trimodal* pattern, with a significant number of drawings showing no particular direction preference. Do people actually draw from both sides and meet in the middle?

The distibutions of different shapes are slightly shifted away from each other due to the different number of segments for different shapes. If we adjust for number of segments, we see the distribution below:

In [None]:
f, ax = plt.subplots(figsize=(14, 10))
for word in combined['word'].unique():
    sns.kdeplot(data=combined[(combined['word'] == word) & (np.abs(combined['per_seg_score']) < 2)]['per_seg_score'], ax=ax, label=word)
ax.set_xlabel('counterclockwiseness score per segment')
plt.show()

Here we can more clearly see the bimodal / trimodal patterns. As most of the shapes drawn complete a full turn from start to finish, as expected, we see similar (absolute) values of the most typical score per segment for all of the shapes.