# Converting PDF results of APMO 2015 to CSV

In the past, the results for APMO were reported in `PDF` format. We want to convert these to a more friendly `CSV` format for display and analysis.

In the case of APMO 2015, after copying and pasting the ranked table from the PDF to the text file in `results_pre2016/apmo2015_res_text.txt` we get the following:

```
1
2
...
32
33

USA
KOREA
...
PANAMA
CAMBODIA

...
```

In other words, the columns get stacked one after the other. We will exploit this structure to extract the columns, turn them into a `pandas` data frame for manipulation, and then saving them to `CSV`.

In [1]:
import pandas as pd

# We use a year variable for reusability.
year=2015

We open the file and split it by lines.

In [2]:
with open('results_pre2016/apmo%s_res_text.txt' % year,'r') as filename:
    allinfo=filename.read().splitlines()

print(allinfo)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '17', '19', '20', '21', '22', '23', '24', '24', '26', '27', '28', '29', '30', '31', '32', '33', '', '', 'USA', 'KOREA', 'RUSSIA', 'SINGAPORE', 'JAPAN', 'CANADA', 'THAILAND', 'TAIWAN', 'AUSTRALIA', 'BRAZIL', 'PERU', 'MEXICO', 'HONG KONG', 'KAZAKHSTAN', 'INDONESIA', 'MALAYSIA', 'INDIA', 'TAJIKISTAN', 'BANGLADESH', 'PHILIPPINES', 'TURKMENISTAN', 'SAUDI ARABIA', 'NEW ZEALAND', 'ARGENTINA', 'COLOMBIA', 'SYRIA', 'SRI LANKA', 'EL SALVADOR', 'TRINIDAD AND TOBAGO', 'ECUADOR', 'COSTA RICA', 'PANAMA', 'CAMBODIA', 'Total', '', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '6', '7', '7', '10', '10', '5', '2', '2', '299', '', '298', '279', '266', '259', '256', '237', '228', '222', '205', '202', '185', '169', '167', '163', '161', '134', '127', '127', '122', '105', '99', '94', '86', '73', '73', '52', '48', 

There are 33 participating countries, a last row for totals, and a row between each stacked column. Therefore we may extract columns in 'multiples of 35'. We do this now and print the columns to perform a sanity check that we are not losing information.

In [3]:
columns=[allinfo[35*j:35*j+33] for j in range(8)]
for col in columns:
    print(col)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '17', '19', '20', '21', '22', '23', '24', '24', '26', '27', '28', '29', '30', '31', '32', '33']
['USA', 'KOREA', 'RUSSIA', 'SINGAPORE', 'JAPAN', 'CANADA', 'THAILAND', 'TAIWAN', 'AUSTRALIA', 'BRAZIL', 'PERU', 'MEXICO', 'HONG KONG', 'KAZAKHSTAN', 'INDONESIA', 'MALAYSIA', 'INDIA', 'TAJIKISTAN', 'BANGLADESH', 'PHILIPPINES', 'TURKMENISTAN', 'SAUDI ARABIA', 'NEW ZEALAND', 'ARGENTINA', 'COLOMBIA', 'SYRIA', 'SRI LANKA', 'EL SALVADOR', 'TRINIDAD AND TOBAGO', 'ECUADOR', 'COSTA RICA', 'PANAMA', 'CAMBODIA']
['10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '6', '7', '7', '10', '10', '5', '2', '2']
['298', '279', '266', '259', '256', '237', '228', '222', '205', '202', '185', '169', '167', '163', '161', '134', '127', '127', '122', '105', '99', '94', '86', '73', '73', '52', '48', '47', '31', '27', '20', '12',

Now we are ready to transform this information to a `pandas` dataframe.
Also, from the information above, we note that we will need to change the name of three countries to get the standard country names that we are using. 

In [4]:
data = pd.DataFrame(list(zip(*columns)), columns=['Rank', 'Country', '# of Contestants', 'Total Score', 'Gold Awards', 'Silver Awards', 'Bronze Awards', 'Honorable Mentions'])
data.Country=data.Country.str.title()
data.loc[0,'Country']='United States of America'
data.loc[1,'Country']='Republic of Korea'
data.loc[28,'Country']='Trinidad and Tobago'
data

Unnamed: 0,Rank,Country,# of Contestants,Total Score,Gold Awards,Silver Awards,Bronze Awards,Honorable Mentions
0,1,United States of America,10,298,1,2,4,3
1,2,Republic of Korea,10,279,1,2,4,3
2,3,Russia,10,266,1,2,4,3
3,4,Singapore,10,259,1,2,4,3
4,5,Japan,10,256,1,2,4,3
5,6,Canada,10,237,1,2,4,3
6,7,Thailand,10,228,1,2,4,3
7,8,Taiwan,10,222,1,2,4,3
8,9,Australia,10,205,1,2,4,3
9,10,Brazil,10,202,1,2,4,3


Now we add the ISO three letter code that we use for navigation on the website. We load the info from the `iso-alpha-3.csv` file.

When we perform the merge, Pandas reorders the rows. This is an undesired behaviour, so we order back by rank. To do this, we first need to convert the rank column type to `int`.

In [5]:
codes=pd.read_csv('iso-alpha-3.csv')
data_coded=pd.merge(codes,data,left_on='country', right_on='Country', how='right').drop('country', axis=1)
data_coded['Rank']=data_coded.Rank.astype(int)
data_coded=data_coded.sort_values('Rank')
cols=data_coded.columns.tolist()
data_coded=data_coded[[cols[1],cols[0]]+cols[2:]]
data_coded.rename(columns={'code':'Code'}, inplace=True)
data_coded

Unnamed: 0,Rank,Code,Country,# of Contestants,Total Score,Gold Awards,Silver Awards,Bronze Awards,Honorable Mentions
32,1,USA,United States of America,10,298,1,2,4,3
15,2,KOR,Republic of Korea,10,279,1,2,4,3
22,3,RUS,Russia,10,266,1,2,4,3
24,4,SGP,Singapore,10,259,1,2,4,3
13,5,JPN,Japan,10,256,1,2,4,3
5,6,CAN,Canada,10,237,1,2,4,3
29,7,THA,Thailand,10,228,1,2,4,3
27,8,TWN,Taiwan,10,222,1,2,4,3
1,9,AUS,Australia,10,205,1,2,4,3
3,10,BRA,Brazil,10,202,1,2,4,3


Now the information is exactly in the form that we need. We save the work.

In [6]:
data_coded.to_csv('reports/by_country_ranked_%s.csv' % year,index=False)