# DAV-5400
# Project 3
# Text Processing: Analyzing Chess Tournament Results
# Mark Kaplan, Jordan Armstrong, and Yitzhar Shalom


We are given [this file](https://raw.githubusercontent.com/yitzhar/DAV-5400/main/tournamentinfo.txt) containing information about chess tournaments. The task is to generate a .CSV file which extracts the following information for each of the chess players: 

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Tournament Chess Rating of 
Opponents 

To accomplish this, we need to load the data, reformat it, and extract and calculate the desired information, which then needs to be passed into a new dataframe which can finally be converted to a csv file. 

## Loading Data

In [8]:
#pandas and numpy are imported, as well as regex
import numpy as np 
import pandas as pd
import re

#data is displayed in its raw form 
filename = 'https://raw.githubusercontent.com/yitzhar/DAV-5400/main/tournamentinfo.txt'
data = np.loadtxt(filename, delimiter=',', skiprows=1, dtype=str)
print(data)

[' Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| '
 ' Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | '
 '-----------------------------------------------------------------------------------------'
 '    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|'
 '   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |'
 '-----------------------------------------------------------------------------------------'
 '    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|'
 '   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |'
 '-----------------------------------------------------------------------------------------'
 '    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|'
 '   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B

## Cleaning Data

The first three rows are irrelevant, then the fourth is line 1 of player data, the fifth is line 2 of player data, and the sixth is irrelevant. The pattern repeats itself across all of the rows. The first element (excluding the index) of each line1 is the player's name, and the first element of line2 is the player's state. This can be used to extract the necessary information from the dataframe. 

In [9]:
#remove unnecessary first three rows
data1 = data[3:] 
data1

array(['    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|',
       '   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |',
       '-----------------------------------------------------------------------------------------',
       '    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|',
       '   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |',
       '-----------------------------------------------------------------------------------------',
       '    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|',
       '   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |',
       '-----------------------------------------------------------------------------------------',
       '    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|',


In [10]:
#the data is converted from an array to a dataframe for ease of analysis
dafr = pd.DataFrame(data1)
dafr

Unnamed: 0,0
0,1 | GARY HUA |6.0 ...
1,ON | 15445895 / R: 1794 ->1817 |N:2 ...
2,----------------------------------------------...
3,2 | DAKSHESH DARURI |6.0 ...
4,MI | 14598900 / R: 1553 ->1663 |N:2 ...
...,...
187,MI | 15057092 / R: 1175 ->1125 | ...
188,----------------------------------------------...
189,64 | BEN LI |1.0 ...
190,MI | 15006561 / R: 1163 ->1112 | ...


## Data extraction

### Names and States

In [11]:
#group line1 and line2 into new df
line1 = dafr.iloc[0::3, :]
line2 = dafr.iloc[1::3, :]

#convert to string 
info1=line1.to_string()
info2=line2.to_string()

#use regex to extract names
names=re.findall('[A-Z]+[ ][A-Z]*[ ]*[A-Z]+', info1)
print(names)
#regex to extract states
states=re.findall('[A-Z]{2}', info2)
print(states)

['GARY HUA', 'DAKSHESH DARURI', 'ADITYA BAJAJ', 'PATRICK H SCHILLING', 'HANSHI ZUO', 'HANSEN SONG', 'GARY DEE SWATHELL', 'EZEKIEL HOUGHTON', 'STEFANO LEE', 'ANVIT RAO', 'CAMERON WILLIAM MC', 'KENNETH J TACK', 'TORRANCE HENRY JR', 'BRADLEY SHAW', 'ZACHARY JAMES HOUGHTON', 'MIKE NIKITIN', 'RONALD GRZEGORCZYK', 'DAVID SUNDEEN', 'DIPANKAR ROY', 'JASON ZHENG', 'DINH DANG BUI', 'EUGENE L MCCLURE', 'ALAN BUI', 'MICHAEL R ALDRICH', 'LOREN SCHWIEBERT', 'MAX ZHU', 'GAURAV GIDWANI', 'SOFIA ADINA STANESCU', 'CHIEDOZIE OKORIE', 'GEORGE AVERY JONES', 'RISHI SHETTY', 'JOSHUA PHILIP MATHEWS', 'JADE GE', 'MICHAEL JEFFERY THOMAS', 'JOSHUA DAVID LEE', 'SIDDHARTH JHA', 'AMIYATOSH PWNANANDAM', 'BRIAN LIU', 'JOEL R HENDON', 'FOREST ZHANG', 'KYLE WILLIAM MURPHY', 'JARED GE', 'ROBERT GLEN VASEY', 'JUSTIN D SCHILLING', 'DEREK YAN', 'JACOB ALEXANDER LAVALLEY', 'ERIC WRIGHT', 'DANIEL KHAIN', 'MICHAEL J MARTIN', 'SHIVAM JHA', 'TEJAS AYYAGARI', 'ETHAN GUO', 'JOSE C YBARRA', 'LARRY HODGE', 'ALEX KONG', 'MARISA RICC

### Total Points
Since this section was done by a different team member, the file was re-loaded and quickly re-cleaned.

In [12]:
import requests as req
import re
url = 'https://raw.githubusercontent.com/yitzhar/DAV-5400/main/tournamentinfo.txt'
res = req.get(url)

file = open('tournamentinfo.txt', 'w')
file.write(res.text)
file.close()
text = res.text

In [13]:
data = text.replace('\n', ' ')

# removing headers 
no_header = data[361:]
no_header
# remove  dashes, bars, and slashes 
a = no_header.translate({ord(x): None for x in'-'})
b = a.translate({ord(x): None for x in '|'})
c = b.translate({ord(x): None for x in '/'})
clean_data = re.sub('\s+',' ', c)

In [14]:
# regex for finding total points
total_points = [a for a in re.split(r'[^0-9\.?0-9]', clean_data) if re.match('(\d\.\d)',a)]

### Pre-tournament Rating
Using several steps of regular expressions as delineated below, the pre-tournamet rating extracted.

In [15]:
# split by the > character
split = re.split('\>', clean_data)

# remove the characters P and any following numbers and add to split_sub list
split_sub = []

for s in split:
    i = re.sub('[P]+\d+', '', s)
    split_sub.append(i)

# take the last five digts of split_sub, replace any non-digit characters with nothing, and add to list entitled pre_rating_sub
pre_rating_sub = []
for s in split_sub:
    x = s[-5:]
    y = re.sub('\D', '', x)
    pre_rating_sub.append(y)

pre_rating_sub2 = pre_rating_sub[:-1]

# find the integer values
pre_rating = []
for p in pre_rating_sub2:
    pre_rating.append(int(p))

### Average Opponent Pre-tournament Rating 


#### Opponent Pre-tournament List
First we need to create a list of the pre-tourney ratings of the opponents. We can accomplish this using our pre_rating list from the last step

In [16]:
# First we need to get lines with the opponent id for each round
opp_pre = [a for a in re.split('[A-Z]{2,}', clean_data) if re.match('\s+[0-9]+\.',a)]

In [17]:
# Now we have to get rid of the spaces in the list 
opp_pre_2 = []

for a in opp_pre:
    b = re.split('[A-Z]', a)
    c = re.sub('\s', '', a)
    opp_pre_2.append(c)

In [18]:
#Pull apart opponent ids
opp_pre_3 = []
for t in opp_pre_2:
    f = re.split('[A-Z]', t)
    opp_pre_3.append(f)

In [19]:
#Get rid of the last element in our opp_pre_3 list
opp_pre_4 = []
for t in opp_pre_3:
    opp_pre_4.append(t[1:])

In [20]:
#Find how many non-empty rounds we have
number = []
for a in opp_pre_4:
    f = 0
    for j in a:
        if j != '':
            f = f + 1
    number.append(f)


In [21]:
#create a list of opponent ids and make empty scores 0 and convert strings to int values
opp_pre_5 = []

for i in opp_pre_4:
    for j in i:
        if j == '':
            a= None
            opp_pre_5.append(a)
        else:
            a = int(j)
            opp_pre_5.append(a)


In [22]:
#Finally we call back pre rating scores for each opponent
opp_scores = []
for t in opp_pre_5:
    if t != None:
        opp_scores.append(pre_rating[t-1])
    else:
        opp_scores.append(None)

opp_scores
opp_pre_5



[39,
 21,
 18,
 14,
 7,
 12,
 4,
 63,
 58,
 4,
 17,
 16,
 20,
 7,
 8,
 61,
 25,
 21,
 11,
 13,
 12,
 23,
 28,
 2,
 26,
 5,
 19,
 1,
 45,
 37,
 12,
 13,
 4,
 14,
 17,
 34,
 29,
 11,
 35,
 10,
 27,
 21,
 57,
 46,
 13,
 11,
 1,
 9,
 2,
 3,
 32,
 14,
 9,
 47,
 28,
 19,
 25,
 18,
 59,
 8,
 26,
 7,
 20,
 16,
 19,
 55,
 31,
 6,
 25,
 18,
 38,
 56,
 6,
 7,
 3,
 34,
 26,
 42,
 33,
 5,
 38,
 None,
 1,
 3,
 36,
 27,
 7,
 5,
 33,
 3,
 32,
 54,
 44,
 8,
 1,
 27,
 5,
 31,
 19,
 16,
 30,
 22,
 54,
 33,
 38,
 10,
 15,
 None,
 39,
 2,
 36,
 None,
 48,
 41,
 26,
 2,
 23,
 22,
 5,
 47,
 9,
 1,
 32,
 19,
 38,
 10,
 15,
 10,
 52,
 28,
 18,
 4,
 8,
 40,
 49,
 23,
 41,
 28,
 2,
 9,
 43,
 1,
 47,
 3,
 40,
 39,
 6,
 64,
 52,
 28,
 15,
 None,
 17,
 40,
 4,
 43,
 20,
 58,
 17,
 37,
 46,
 28,
 47,
 43,
 25,
 60,
 44,
 39,
 9,
 53,
 3,
 24,
 34,
 10,
 47,
 49,
 40,
 17,
 4,
 9,
 32,
 11,
 51,
 13,
 46,
 37,
 14,
 6,
 None,
 24,
 4,
 22,
 19,
 20,
 8,
 36,
 50,
 6,
 38,
 34,
 52,
 48,
 None,
 52,
 64,
 15,
 55,
 31

#### Getting the Average Opponent Pre-tourney rating
To find the average pre-rating score we need to group the values into sevens: one score for each round and replacing the null values with zero. Calculating the average value is a matter of summing these scores and dividing them by the number of values in the string.

In [23]:
#First we must split the opponent scores into chunks of seven
split = [opp_scores[x:x+7] for x in range(0, len(opp_pre_5), 7)]
split

#sub None for 0
split_sub = []

for f in split:
    for j in f:
        if j == None:
            split_sub.append(0)
        else:
            split_sub.append(j)

#group into seven
split_sub_grouped = [split_sub[x:x+7] for x in range(0, len(split_sub), 7)]

split_sub_grouped

#Get the average
opp_score_avg = []
for i in split_sub_grouped:
    avg = sum(i)/len(i)
    opp_score_avg.append(round(avg))
    
opp_score_avg

[1605,
 1469,
 1564,
 1574,
 1501,
 1519,
 1372,
 1468,
 1523,
 1554,
 1468,
 1291,
 1498,
 1515,
 1484,
 990,
 1499,
 1480,
 1426,
 1411,
 1470,
 1115,
 1214,
 1357,
 1363,
 1507,
 1047,
 1522,
 1126,
 1144,
 1260,
 1379,
 1277,
 1375,
 1150,
 1190,
 989,
 1319,
 1430,
 1391,
 713,
 1150,
 1107,
 1137,
 1152,
 1358,
 1392,
 968,
 918,
 1111,
 1356,
 1495,
 577,
 1034,
 1205,
 1010,
 1168,
 1192,
 1131,
 950,
 1327,
 169,
 964,
 1263]

## Creating the Data Frame 

In [24]:
#pass names and dates into new df and rename columns
new_df=pd.DataFrame(names)
new_df[1]=states
new_df[2] = total_points
new_df[3] = pre_rating
new_df[4] = opp_score_avg
new_df.columns=['Player Name', 'Player State', 'Total Number of Points', "Player's Pre-Rating", "Average Opponents Pre-Tournament Rating"]
new_df

Unnamed: 0,Player Name,Player State,Total Number of Points,Player's Pre-Rating,Average Opponents Pre-Tournament Rating
0,GARY HUA,ON,6.0,1794,1605
1,DAKSHESH DARURI,MI,6.0,1553,1469
2,ADITYA BAJAJ,MI,6.0,1384,1564
3,PATRICK H SCHILLING,MI,5.5,1716,1574
4,HANSHI ZUO,MI,5.5,1655,1501
...,...,...,...,...,...
59,JULIA SHEN,MI,1.5,967,950
60,JEZZEL FARKAS,ON,1.5,955,1327
61,ASHWIN BALAJI,MI,1.0,1530,169
62,THOMAS JOSEPH HOSMER,MI,1.0,1175,964


## Finally we export our Data Frame to a CSV
We use the .to_csv function

In [25]:

new_df.to_csv(r'C:\Users\PC\Desktop\Project3.csv')