# Converting PDF results of APMO 2014 to CSV

In the past, the results for APMO were reported in `PDF` format. We want to convert these to a more friendly `CSV` format for display and analysis.

In the case of APMO 2014, after copying and pasting the ranked table from the PDF to the text file in `results_pre2016/apmo2014_res_text.txt` we get the following:

```
1
2
...
32
33

KOREA
USA
...
URUGUAY
ECUADOR

...
```

In other words, the columns get stacked one after the other. We will exploit this structure to extract the columns, turn them into a `pandas` data frame for manipulation, and then saving them to `CSV`.

In [8]:
import pandas as pd

# We use a year variable for reusability.
year=2014

We open the file and split it by lines.

In [9]:
with open('results_pre2016/apmo%s_res_text.txt' % year,'r') as filename:
    allinfo=filename.read().splitlines()

print(allinfo)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '22', '24', '25', '25', '27', '27', '29', '30', '31', '32', '33', '34', '35', '36', '', '', 'KOREA', 'USA', 'RUSSIA', 'BRAZIL', 'THAILAND', 'JAPAN', 'CANADA', 'TAIWAN', 'AUSTRALIA', 'MEXICO', 'HONG KONG', 'SINGAPORE', 'INDONESIA', 'ARGENTINA', 'KAZAKHSTAN', 'PERU', 'MALAYSIA', 'PHILIPPINES', 'BANGLADESH', 'NEW ZEALAND', 'TAJIKISTAN', 'SYRIA', 'TURKMENISTAN', 'SAUDI ARABIA', 'PAKISTAN', 'SRI LANKA', 'CAMBODIA', 'COLOMBIA', 'AZERBAIJAN', 'PANAMA', 'EL SALVADOR', 'COSTA RICA', 'TRINIDAD AND TOBAGO', 'KYRGYZ REPUBLIC', 'URUGUAY', 'ECUADOR', 'Total', '', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '7', '10', '8', '7', '10', '10', '10', '5', '10', '3', '10', '1', '7', '6', '5', '3', '6', '10', '308', '', '244', '231', '218', '179', '175', '174', '167', '155', '149', '144', '141', '129', '126', '115', '111',

There are 36 participating countries, a last row for totals, and a row between each stacked column. Therefore we may extract columns in 'multiples of 38'. We do this now and print the columns to perform a sanity check that we are not losing information.

In [10]:
columns=[allinfo[38*j:38*j+36] for j in range(8)]
for col in columns:
    print(col)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '22', '24', '25', '25', '27', '27', '29', '30', '31', '32', '33', '34', '35', '36']
['KOREA', 'USA', 'RUSSIA', 'BRAZIL', 'THAILAND', 'JAPAN', 'CANADA', 'TAIWAN', 'AUSTRALIA', 'MEXICO', 'HONG KONG', 'SINGAPORE', 'INDONESIA', 'ARGENTINA', 'KAZAKHSTAN', 'PERU', 'MALAYSIA', 'PHILIPPINES', 'BANGLADESH', 'NEW ZEALAND', 'TAJIKISTAN', 'SYRIA', 'TURKMENISTAN', 'SAUDI ARABIA', 'PAKISTAN', 'SRI LANKA', 'CAMBODIA', 'COLOMBIA', 'AZERBAIJAN', 'PANAMA', 'EL SALVADOR', 'COSTA RICA', 'TRINIDAD AND TOBAGO', 'KYRGYZ REPUBLIC', 'URUGUAY', 'ECUADOR']
['10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '10', '7', '10', '8', '7', '10', '10', '10', '5', '10', '3', '10', '1', '7', '6', '5', '3', '6', '10']
['244', '231', '218', '179', '175', '174', '167', '155', '149', '144', '141', '129', '126', '115', '111', '109', '94', '79', '50', '44

Now we are ready to transform this information to a `pandas` dataframe.
Also, from the information above, we note that we will need to change the name of some countries to get the standard country names that we are using. 

In [12]:
data = pd.DataFrame(list(zip(*columns)), columns=['Rank', 'Country', '# of Contestants', 'Total Score', 'Gold Awards', 'Silver Awards', 'Bronze Awards', 'Honorable Mentions'])
data.Country=data.Country.str.title()
data.loc[0,'Country']='Republic of Korea'
data.loc[1,'Country']='United States of America'
data.loc[32,'Country']='Trinidad and Tobago'
data.loc[33,'Country']='Kyrgyzstan'
data

Unnamed: 0,Rank,Country,# of Contestants,Total Score,Gold Awards,Silver Awards,Bronze Awards,Honorable Mentions
0,1,Republic of Korea,10,244,1,2,4,3
1,2,United States of America,10,231,1,2,4,3
2,3,Russia,10,218,1,2,4,3
3,4,Brazil,10,179,1,2,4,3
4,5,Thailand,10,175,1,2,4,3
5,6,Japan,10,174,1,2,4,3
6,7,Canada,10,167,1,2,4,3
7,8,Taiwan,10,155,1,2,4,3
8,9,Australia,10,149,1,2,4,3
9,10,Mexico,10,144,1,2,4,3


Now we add the ISO three letter code that we use for navigation on the website. We load the info from the `iso-alpha-3.csv` file.

When we perform the merge, Pandas reorders the rows. This is an undesired behaviour, so we order back by rank. To do this, we first need to convert the rank column type to `int`.

In [13]:
codes=pd.read_csv('iso-alpha-3.csv')
data_coded=pd.merge(codes,data,left_on='country', right_on='Country', how='right').drop('country', axis=1)
data_coded['Rank']=data_coded.Rank.astype(int)
data_coded=data_coded.sort_values('Rank')
cols=data_coded.columns.tolist()
data_coded=data_coded[[cols[1],cols[0]]+cols[2:]]
data_coded.rename(columns={'code':'Code'}, inplace=True)
data_coded

Unnamed: 0,Rank,Code,Country,# of Contestants,Total Score,Gold Awards,Silver Awards,Bronze Awards,Honorable Mentions
15,1,KOR,Republic of Korea,10,244,1,2,4,3
34,2,USA,United States of America,10,231,1,2,4,3
24,3,RUS,Russia,10,218,1,2,4,3
4,4,BRA,Brazil,10,179,1,2,4,3
31,5,THA,Thailand,10,175,1,2,4,3
13,6,JPN,Japan,10,174,1,2,4,3
6,7,CAN,Canada,10,167,1,2,4,3
29,8,TWN,Taiwan,10,155,1,2,4,3
1,9,AUS,Australia,10,149,1,2,4,3
18,10,MEX,Mexico,10,144,1,2,4,3


Now the information is exactly in the form that we need. We save the work.

In [14]:
data_coded.to_csv('reports/by_country_ranked_%s.csv' % year,index=False)