# All-Star Classifier
### Project Question: 
Using data scraped from the basketball reference from the 17-18 NBA season, can we accurately classify whether a player is an all-star?
### Goals:
- Scrape a valid and clean data frame from basketball reference
- Investigate each core variable, understand relationships, and check whether new variables can be created
- Compare across industry standard classification techniques, and tune an appropriate model for classification

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
from collections import OrderedDict
import requests
np.random.seed()

### Web Scraping 
Using the python package BeautifulSoup, we can scrape data directly from [www.basketballl-refence.com](www.basketballl-refence.com), and place the data into a pandas DataFrame.

In [2]:

def gather_data(url):
    season = []
    page_req = requests.get(url)
    soup = BeautifulSoup(page_req.text, 'lxml')#'lmxl' tells you how to parse 
    table = soup.find('table')
    table_body = table.find('tbody')
    rows =table_body.findAll('tr')# find all rows of data table
    season=[]
    fields = ['Player','Pos','Age','Tm','G','GS','MP','FG','FGA','FG%','3P','3PA','3P%','2P','2PA','2P%','eFG%','FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PS/G']
    #2P = 2 pointers made, 2PA = 2 pointers Attempted, @P% = 2-Pointer percentage etc.
    #Pos = position, TM = team, ORB = Offensive Rebound, DRB = Defensive Rebound, 
    #eFG% = effective Field Goal % (positively weights 3 pointers made)
    for row in rows:
        cell = row.findAll('td')
        if cell:
            cell = [c.text for c in cell]
            entry = OrderedDict(zip(fields, cell))
            season.append(entry)
    return pd.DataFrame(season)

In [3]:
df=gather_data('https://www.basketball-reference.com/leagues/NBA_2018_per_game.html')
df.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PS/G
0,Alex Abrines,SG,24,OKC,75,8,15.1,1.5,3.9,0.395,...,0.848,0.3,1.2,1.5,0.4,0.5,0.1,0.3,1.7,4.7
1,Quincy Acy,PF,27,BRK,70,8,19.4,1.9,5.2,0.356,...,0.817,0.6,3.1,3.7,0.8,0.5,0.4,0.9,2.1,5.9
2,Steven Adams,C,24,OKC,76,76,32.7,5.9,9.4,0.629,...,0.559,5.1,4.0,9.0,1.2,1.2,1.0,1.7,2.8,13.9
3,Bam Adebayo,C,20,MIA,69,19,19.8,2.5,4.9,0.512,...,0.721,1.7,3.8,5.5,1.5,0.5,0.6,1.0,2.0,6.9
4,Arron Afflalo,SG,32,ORL,53,3,12.9,1.2,3.1,0.401,...,0.846,0.1,1.2,1.2,0.6,0.1,0.2,0.4,1.1,3.4


### Data Cleaning
Some of the data must be adjusted in order for our Logistic Regression to work.

 - Players who have been traded, or played on multiple teams for any reason will have a row for their stats with each team, as well as a row with their season total. We are only interested in total row, and will remove the others from the DataFrame.

 - If a player has, for example, taken 0 3 pointers, their 3PFG% will be represented as an empty string "". We will change these to 0.

In [4]:
#players who have been traded have a row for each team, + a total row. We only want the total
def prepareDF(df):
    traded_players = list(df[df.Tm == 'TOT'].Player)
    traded=list(df[df.Tm=='TOT'].Player)
    indices=[]
    for i,row in df.iterrows():
        if row.Player in traded and row.Tm != 'TOT':
            indices.append(i)
            #print "ds"
    # reset indices
    newDF=df.drop(indices)
    newDF=newDF.reset_index(drop=True)
    newDF=newDF.replace("",0) #replaces empty stats with 0
    return newDF

In [5]:
originalData=prepareDF(df)

### Choosing Parameters
One of the benefits of creating many models on 2017-2018 season is we can find which combination of parameters is the most accurate. Here we drop the unwanted columns.

 - Drop Offensive Rebounds and Defensive Rebounds to to their collinearity with Rebounds.
 - Drop Player, Team, and Position because they are strings.
 - Drop Games, Games Started, and Age because their values did not seem to impact the accuracy beneficially. CHANGE MAYBE or Add Code

In [6]:
finalData=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','G','GS','Pos'],axis=1)  

### Create a List of All-Star Flags

 - Compile a list of the All-Stars for the 2017-2018 season manually. 
 - Create a list named _ASFlag_ indicating True or False foreach respective index of _orignalData_
 
This needs to be a separate list, and not part of the DataFrame for the Logistic Regression function to run properly

In [7]:
#create list of True/False to mark all-stars to use in training data
all_stars = [
    'LeBron James','Kevin Durant','Anthony Davis',
    'Kyrie Irving','DeMarcus Cousins','LaMarcus Aldridge',
    'Bradley Beal','Goran Dragic','Andre Drummond','Paul George',
    'Victor Oladipo','Kemba Walker','Russell Westbrook','Kevin Love',
    'Kristaps Porzingis','John Wall','Stephen Curry','James Harden',
    'DeMar DeRozan','Giannis Antetokounmpo','Joel Embiid','Jimmy Butler',
    'Draymond Green','Al Horford','Damian Lillard','Kyle Lowry',
    'Klay Thompson','Karl-Anthony Towns'
]
#want all stars as separate list for regression
ASFlag=[]
for name in originalData.Player:
    if name in all_stars:
        ASFlag.append(True)
    else:
        ASFlag.append(False)

### Logistic Regression 
Now that all of your data is prepared, we are ready to train a model.
 - Import the Logistic Regression function from scikit-learn
 - Create relevent helper functions to better describe and display data
 - Design a function to split _finalData_ into training sets and testing sets to train and test our model using an 80/20 split
 - Determine accuracy and viability of the model

In [8]:
from sklearn.linear_model import LogisticRegression

In [9]:
# declares an incorrect prediction as False Positive, if a non-Allstar is marked as an All-Star,
#or Missed AllStar, if an Allstar is classified as a non-Allstar
def resType(pred,data):
    if pred==True and data==False:
        return " - False Positive"
    else :
        return " - Missed AllStar"
#Allows the function orsted to sort a list of tuples by 2nd item of the tuple
def getKey(item):
    return item[1]

In [10]:
#@param df is the data with duplicates removed, and the appropriate columns dropped (player names, etc.)
#@param original_df is the data with duplicates removed returned from prepareDF
#@param displayOutcome determines whether or not thee function should print out descriptions of how the model performed 
def classify(df, original_df=originalData, displayOutcome=True):
    msk = np.random.rand(len(df)) < 0.8
    testing_flags = [ASFlag[i] for i in xrange(len(ASFlag)) if not msk[i]]
    numAllStars = sum(testing_flags)
    #training set should have at least 20 of the 28 all stars in it
    if numAllStars>7:
        classify(df,original_df,displayOutcome)
    train = df[msk]
    training_flags = [ASFlag[i] for i in xrange(len(ASFlag)) if msk[i]]
    test = df[~msk]
    clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(train, training_flags)      
    predictions = list(clf.predict(test))
    numCorrect=0
    #list of wrong tuples, in order of predicted then actual
    wrong=[]
    correctAS=[]
    for pred,data,i in zip(predictions,testing_flags,xrange(len(predictions))):
        #print pred, dafota
        if pred==data:
            numCorrect+=1
            #print i
            #rint count
            if pred==True:
                correctAS.append(i)
        else:
            wrong.append((i,pred,data))
    if displayOutcome==True:
        print "Number of Correct Predictions: ", numCorrect
        print "Total Predictions Made: ", len(predictions)
        print numCorrect*1.0/len(predictions), "\n"

        print "Correctly Predicted Allstars:\n"
        for indx in correctAS:
            ogIndex=test.index[indx]
            print original_df.iloc[ogIndex,0]
        print "\nWrong Predictions: \n"
        for i,pred,data in wrong:
            ogIndex=test.index[i]
            print original_df.iloc[ogIndex,0] + resType(pred,data)
    diagnostics=(numCorrect,len(predictions),numAllStars,len(correctAS),clf)
    return diagnostics

In [11]:
classify(finalData,originalData,displayOutcome=True)

Number of Correct Predictions:  109
Total Predictions Made:  113
0.964601769912 

Correctly Predicted Allstars:

DeMarcus Cousins
Joel Embiid
James Harden
Damian Lillard
Karl-Anthony Towns

Wrong Predictions: 

MarShon Brooks - False Positive
DeMar DeRozan - Missed AllStar
Kyle Lowry - Missed AllStar
Lou Williams - False Positive


(109,
 113,
 7,
 5,
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='multinomial',
           n_jobs=1, penalty='l2', random_state=0, solver='lbfgs',
           tol=0.0001, verbose=0, warm_start=False))

In [12]:
def testClassifier1(its,df,ogdf=originalData):
    correct=0
    total=0
    correctAllStars=0
    totAllStars=0

    for i in xrange(its):
        clf=classify(df,ogdf,displayOutcome=False)
        correct+=clf[0]
        total+=clf[1]
        totAllStars+=clf[2]
        correctAllStars+=clf[3]
    totAcc = 1.0*correct/total
    asAcc = 1.0*correctAllStars/totAllStars
    
    print "Over ", its, " iterations, the classifier predicted ",correct, "outcomes correctly out of ", total, "total data points for an overall accuracy of ", totAcc ,"."
    print "Of the ", totAllStars, " All-Stars that were classified, ", correctAllStars, " were correctly identified, for an accuracy of ", asAcc, "." 
testClassifier1(1000, finalData)

Over  1000  iterations, the classifier predicted  104605 outcomes correctly out of  108091 total data points for an overall accuracy of  0.967749396342 .
Of the  5622  All-Stars that were classified,  3492  were correctly identified, for an accuracy of  0.621131270011 .


### Making A More Relevant Model
During the Logistic Regression, likelihood values between 0 and 1 are assigned to each player of the testing data, and a likelihood greater than 0.5 is classified as an All-Star. However, if we know there are 6 All-Stars in the testing data, it would be more relevant to find the 6 players most likely to be all-stars,

 - Adjust classifying function to return a more relevant result

In [13]:
def classifyTopX(df, original_df=originalData,testing=False):
    msk = np.random.rand(len(df)) < 0.8
    testing_flags = [ASFlag[i] for i in xrange(len(ASFlag)) if not msk[i]]
    numAllStars = sum(testing_flags) #number of allstars we want (we'll find the best numAllStars)
    #training set should have at least 20 of the 28 all stars in it
    if numAllStars>8:
        return 0,0,0,""
    else:
        if not testing:
            print "We need to predict " + str(numAllStars) + " more All_Stars."
        train = df[msk]
        training_flags = [ASFlag[i] for i in xrange(len(ASFlag)) if msk[i]]
        test = df[~msk]
        clf2 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(train, training_flags)  
        realAllStars=[original_df.iloc[test.index[i],0] for i in xrange(len(test)) if original_df.iloc[test.index[i],0] in all_stars]
        if not testing:
            print "Correct Answers are:"
            for each in realAllStars:
                print each
       
        likelihoods=clf2.predict_proba(test)
        #print likelihoods
        posLikelihood=[(likelihoods[i][1],i) for i in xrange(len(likelihoods))]
        top=sorted(posLikelihood,reverse=True)
        t2=[list((original_df.iloc[test.index[indx],0],likelihood)) for likelihood,indx in top]    
        resultsDF=pd.DataFrame(t2,columns=["Name","Likelihood"]).head(numAllStars+5)      
        predictions = [t2[i][0] for i in xrange(numAllStars)]
        #print predictions
        numCorrect=0
        correctAS=[]
        wrong=[]
        #list of wrong tuples, in order of predicted then actual
        for i in xrange(numAllStars):
            if predictions[i] in realAllStars:
                numCorrect+=1
                correctAS.append(predictions[i])
            else:
                wrong.append(predictions[i])

        if not testing:
            print "We predicted " +str(numCorrect) +" correctly out of " + str(numAllStars)+"."
            print "Our "+ str(numAllStars) + " predictions, with the next 5 best predictions are outputted below:"
    if testing==True:
        return [resultsDF,numCorrect,numAllStars,wrong,clf2]
    #return the classifier so we can use it to predict next season
    return  resultsDF,clf2


In [14]:
clf2=classifyTopX(finalData, testing=False)
clf2[0]

We need to predict 2 more All_Stars.
Correct Answers are:
James Harden
Russell Westbrook
We predicted 2 correctly out of 2.
Our 2 predictions, with the next 5 best predictions are outputted below:


Unnamed: 0,Name,Likelihood
0,James Harden,0.986231
1,Russell Westbrook,0.894135
2,Chris Paul,0.785062
3,Tyreke Evans,0.48255
4,Andre Ingram,0.325736
5,Will Barton,0.266307
6,Hassan Whiteside,0.25182


In [15]:
#to allow sorting by number of false allstars
def sum_repeats(l):
    if len(l)==0:
        return 0
    d=[]
    for entry in l:
        try:
            index=[each[0]for each in d].index(entry)
            d[index][1]+=1
        except ValueError:
            d.append([entry,1])
    return d

In [16]:
def testClassifier2(its,df,ogdf=originalData,output=True):
    correct=0
    total=0
    compile_wrongs=[]
    for i in xrange(its):
        clf=classifyTopX(df,ogdf,testing=True)
        correct+=clf[1]
        total+=clf[2]
        if len(clf[3])>0:
            for each in clf[3]:
                compile_wrongs.append(each)
    totAcc = 1.0*correct/total
    
    print "Over " + str(its) + " iterations, the classifier predicted " + str(correct) + " All-Stars correctly out of " +str(total) + " total attempts." 
    print "The classifier had an overall accuracy of " + str(totAcc) + "."
    if output==True:
        return compile_wrongs
sorted(sum_repeats(testClassifier2(1000, finalData)),key=getKey,reverse=True)

Over 1000 iterations, the classifier predicted 3209 All-Stars correctly out of 4648 total attempts.
The classifier had an overall accuracy of 0.690404475043.


[[u'Chris Paul', 183],
 [u'Nikola Jokic', 167],
 [u'Blake Griffin', 166],
 [u'Devin Booker', 151],
 [u'Lou Williams', 124],
 [u'MarShon Brooks', 120],
 [u'Marc Gasol', 78],
 [u'Ben Simmons', 65],
 [u'Jrue Holiday', 59],
 [u'Tyreke Evans', 59],
 [u'CJ McCollum', 36],
 [u'Nikola Vucevic', 36],
 [u'Aaron Gordon', 34],
 [u'Rudy Gobert', 32],
 [u'Dwight Howard', 29],
 [u'Khris Middleton', 13],
 [u'Nikola Mirotic', 12],
 [u'Tobias Harris', 12],
 [u'Kawhi Leonard', 8],
 [u'Harrison Barnes', 7],
 [u'Clint Capela', 6],
 [u'Andre Ingram', 6],
 [u'Eric Gordon', 5],
 [u'Hassan Whiteside', 5],
 [u'Donovan Mitchell', 4],
 [u'Lauri Markkanen', 4],
 [u'DeAndre Jordan', 4],
 [u'Mike Conley', 3],
 [u'Eric Bledsoe', 2],
 [u'Lonzo Ball', 2],
 [u'Gary Harris', 1],
 [u'Will Barton', 1],
 [u'Dennis Schroder', 1],
 [u'Danilo Gallinari', 1],
 [u'J.J. Redick', 1],
 [u'Tim Hardaway', 1],
 [u'T.J. Warren', 1]]

### Common False Positives:
Jokic -  18.5 PTS / 10.7 REB /6.1 AST

Chris Paul - 18.6 / 5.4 / 7.9 

Marc Gasol - 17.3 / 8.1 / 4.2

Ben Simmons - 15.8 / 8.1 / 8.2

Devin Booker - 24.9 / 4.5 / 4.7

Lou Williams - 22.6 / 2.5 / 5.3


### Performance

 - The first classifier prodicts All-Stars correctly with roughly 61% accuracy. When it knows how many All-Stars it is looking for, this increases to around 70% accuracy.
 - The common False Positives that repeatedly outputted definitely had All-Star numbers, and all had strong cases to make the team last year, with the possible exceptions of Marshon Brooks and Tyreke Evans.
 - Overall, though not perfect, the calssifier is working as intended.


### Potential Adjustments

We are now ready to use our model to predict the All-Stars for the 2018-2019 season, however, if there is room to improve our model's accuracy we must look into it.
 - Try different combinations of parameters. Previously we dropped Minutes Played, Offensive Rebounds, Defensive Rebounds, Age, Games, and Games Started from consideration
 
### Investigation into Games as a viable predictor

In [17]:
#Adjustment  1 -- Keeping games and games started -- ~72.5 Accuracy
dataAdjustment1=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos'],axis=1)  
testClassifier2(1000, dataAdjustment1,output=False)

Over 1000 iterations, the classifier predicted 3376 All-Stars correctly out of 4697 total attempts.
The classifier had an overall accuracy of 0.718756653183.


In [18]:
#Adjustment  2 -- Replacing Games and Games Started with startedPercent -- ~70% Accuracy
gs=[int(i) for i in originalData['GS']]
g=[int(i) for i in originalData['G']]

startPercent=[gs[i]*1.0/g[i] for i in xrange(len(g))]
originalData['startPercent']=startPercent
dataAdjustment2=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','G','GS'],axis=1)  
testClassifier2(1000, dataAdjustment2,output=False)

#Reset originalData to its initial state (without startPercent)
originalData=prepareDF(df)

Over 1000 iterations, the classifier predicted 3365 All-Stars correctly out of 4736 total attempts.
The classifier had an overall accuracy of 0.710515202703.


In [19]:
#Adjustment  3 --Just Games Started -- ~71% Accuracy

dataAdjustment3=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','G',],axis=1)  
testClassifier2(1000, dataAdjustment3,output=False)

Over 1000 iterations, the classifier predicted 3355 All-Stars correctly out of 4629 total attempts.
The classifier had an overall accuracy of 0.724778569886.


In [20]:
#Adjustment  4 --Just Games -- ~68% Accuracy

dataAdjustment4=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','GS',],axis=1)  
testClassifier2(1000, dataAdjustment4,output=False)

Over 1000 iterations, the classifier predicted 3341 All-Stars correctly out of 4797 total attempts.
The classifier had an overall accuracy of 0.69647696477.


### Preliminary Results

It seems as though keeping Games and Games Started in our model is the best option. As the very least, we want to keep Games Started.

| Model          | Accuracy |  
|:----------------:|:----------:|
| Games          | 68%        | 
| startedPercent | 70%        |  
| Neither        | 70%       |   
| Games Started        | 71%       |  
| Games and Games Started        | 72.5%       |   

### Field Goal% vs Attempts and Made

Data for Field Goal Percent is included in attempts and makes, so removing it shouldnn't make a difference. Does changing these parameters improve the model?

In [21]:
#Adjustment  5 -- Removing 2P% -- 72.7%

dataAdjustment5=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','2P%'],axis=1)  
testClassifier2(1000, dataAdjustment5,output=False)

Over 1000 iterations, the classifier predicted 3548 All-Stars correctly out of 4942 total attempts.
The classifier had an overall accuracy of 0.717927964387.


In [22]:
#Adjustment  6 -- Removing 2PA and 2P

dataAdjustment6=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','2PA','2P'],axis=1)  
testClassifier2(1000, dataAdjustment6,output=False)

Over 1000 iterations, the classifier predicted 3336 All-Stars correctly out of 4645 total attempts.
The classifier had an overall accuracy of 0.718191603875.


In [23]:
#Adjustment  7 -- Only Keeping 2PM -- 

dataAdjustment7=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','2PA','2P%'],axis=1)  
testClassifier2(1000, dataAdjustment7,output=False)

Over 1000 iterations, the classifier predicted 3377 All-Stars correctly out of 4662 total attempts.
The classifier had an overall accuracy of 0.724367224367.


In [39]:
#Adjustment  7_b -- no 2P data -- 

dataAdjustment7_b=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','2PA','2P','2P%'],axis=1)  
testClassifier2(1000, dataAdjustment7_b,output=False)

Over 1000 iterations, the classifier predicted 3465 All-Stars correctly out of 4841 total attempts.
The classifier had an overall accuracy of 0.715761206362.


### Takeaways

 - Removing 2P% has no significant effect on the model, as expected
 - Removing 2PA and 2P produces similar results to removing 2P%, and both are marginally better than competely ignoring 2-pointer data
 - Keeping just 2P is the only model that doesn't see a decrease in accuracy
 
We will proceed with Adjustment 7 as our benchmark model

### Repeat this process with Free Throws, and 3-Pointers, comparing to a baseline of ~72.5%

In [45]:
#Adjustments 8 -10 -- Free Throws and 3-Pointers
print "Removing FT%\n"
dataAdjustment8=originalData.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','2PA','FT%'],axis=1)  
testClassifier2(1000, dataAdjustment8,output=False)

print "Removing FT, FTA\n"
dataAdjustment9=originalData.drop(['Player','Tm', 'MP','ORB','DRB','Age','Pos','2P%','2PA','FT','FTA'],axis=1)  
testClassifier2(1000, dataAdjustment9,output=False)

print "Removing FT, FTA, FT%'\n"
dataAdjustment10=originalData.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','2PA','FT','FTA'],axis=1)  
testClassifier2(1000, dataAdjustment10,output=False)

Removing FT%
Over 1000 iterations, the classifier predicted 3377 All-Stars correctly out of 4736 total attempts.
The classifier had an overall accuracy of 0.713048986486.
Removing FT, FTA
Over 1000 iterations, the classifier predicted 3511 All-Stars correctly out of 4753 total attempts.
The classifier had an overall accuracy of 0.73869135283.
Removing FT, FTA, FT%
Over 1000 iterations, the classifier predicted 3516 All-Stars correctly out of 4804 total attempts.
The classifier had an overall accuracy of 0.73189009159.


- The discrepancies are all very slight, but either removing FT and FTA in favor of only FT% is the most accurate model (73.8%) by a small margin
- Removing all free throw related stats also improves out model
- Moving forward we will remove FT and FTA in favor of just utilizing FT%

In [47]:
#Adjustments 11-13 -- Free Throws and 3-Pointers
print "Removing 3P% 3PA\n"
dataAdjustment11=originalData.drop(['Player','Tm', 'MP','ORB','DRB','Age','Pos','2P%','2PA','FT','FTA','3P%', '3PA'],axis=1)  
testClassifier2(1000, dataAdjustment11,output=False)

print "Removing 3P, 3PA\n"
dataAdjustment12=originalData.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','2PA','FT','FTA','3P','3PA'],axis=1)  
testClassifier2(1000, dataAdjustment12,output=False)

print "Removing 3P, 3PA, 3P%\n"
dataAdjustment13=originalData.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','2PA','FT','FTA','3P','3PA','3P%'],axis=1)  
testClassifier2(1000, dataAdjustment13,output=False)



Removing 3P% 3PA
Over 1000 iterations, the classifier predicted 3453 All-Stars correctly out of 4708 total attempts.
The classifier had an overall accuracy of 0.733432455395.
Removing 3P, 3PA
Over 1000 iterations, the classifier predicted 3450 All-Stars correctly out of 4713 total attempts.
The classifier had an overall accuracy of 0.732017823043.
Removing 3P, 3PA, 3P%
Over 1000 iterations, the classifier predicted 3581 All-Stars correctly out of 4785 total attempts.
The classifier had an overall accuracy of 0.748380355277.


 - Removing all statistics on 3-pointers raises our accuracy to ~74.8%!
     - If we created a separate model just for predicting guards, this would probably not be the case
 - Adjustment 13 will now be our benchmark model
 
 ## Minutes Played and Personal Fouls

In [48]:
print "Keeping MP\n"
dataAdjustment14=originalData.drop(['Player','Tm', 'ORB','DRB','Age','Pos','2P%','2PA','FT','FTA','3P','3PA','3P%'],axis=1)  
testClassifier2(1000, dataAdjustment14,output=False)

Keeping MP

Over 1000 iterations, the classifier predicted 3414 All-Stars correctly out of 4734 total attempts.
The classifier had an overall accuracy of 0.721166032953.


 - This model is over 2% less accurate, so we will continue ignoring minutes played

In [53]:
print "Removing Fouls\n"
dataAdjustment15=originalData.drop(['Player','Tm', 'MP','ORB','DRB','Age','Pos','2P%','2PA','FT','FTA','3P','3PA','3P%','PF'],axis=1)  
testClassifier2(1000, dataAdjustment15,output=False)

Removing Fouls

Over 1000 iterations, the classifier predicted 3455 All-Stars correctly out of 4724 total attempts.
The classifier had an overall accuracy of 0.731371718882.


 - This model is quite good, but overall slightly less accurate than Adjustment 13

In [55]:
print "Removing eFG%\n"
dataAdjustment16=originalData.drop(['Player','Tm', 'MP','ORB','DRB','Age','Pos','2P%','2PA','FT','FTA','3P','3PA','3P%','PF','eFG%'],axis=1)  
testClassifier2(1000, dataAdjustment16,output=False)

Removing eFG%

Over 1000 iterations, the classifier predicted 3318 All-Stars correctly out of 4620 total attempts.
The classifier had an overall accuracy of 0.718181818182.


 - Removing effective Field Goal Percentage definitely did not improve our model.
 - There are hundreds of possible combinations of parameters we could check, but these seemed like the most reasonable to test and compare
 - We will proceed by using dataAdjustment13 as the data to train our classifier with
 

## Predicting the 2018-2019 All-Stars
 - Gather data for the 2018-2019 season (up end of day 1/23/2019)
 - We will now use all of the data from 2017-2018 to train the classifer instead of splitting it 80/20
 - insert the 2018-2019 season data as the testing data, and predict the All-Star teams for this year

In [58]:
#get 2018-2019 data 
data2019=gather_data('https://www.basketball-reference.com/leagues/NBA_2019_per_game.html')

In [59]:
og2019=prepareDF(data2019)

In [60]:
cleaned_2019=og2019.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','2PA','FT','FTA','3P','3PA','3P%'],axis=1)

In [61]:
clf_final = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(dataAdjustment13, ASFlag)  
probs=clf_final.predict_proba(cleaned_2019)
posLikelihood_2019=[(probs[i][1],i) for i in xrange(len(probs))]
sortedLikelihoods=sorted(posLikelihood_2019,reverse=True)
playerWeights = [list((og2019.iloc[cleaned_2019.index[indx],0],likelihood)) for likelihood,indx in sortedLikelihoods]
df2019=pd.DataFrame(playerWeights)
df2019.head(24)

Unnamed: 0,0,1
0,James Harden,0.991096
1,Anthony Davis,0.971575
2,Stephen Curry,0.947746
3,Kevin Durant,0.928463
4,LeBron James,0.923051
5,Joel Embiid,0.914036
6,Giannis Antetokounmpo,0.875587
7,Damian Lillard,0.864438
8,Kawhi Leonard,0.845124
9,Paul George,0.824542


### Future Work
 1. Tweeking the Model
     - Creating a seperate model for guards and the frontcourt
     - Creating a seperate model for East vs. West
     - Using a different combination of parameters, including advanced stats such as Player Efficiency Rating (PER), or Win Shares
     - Creating our own statistics
          - For example, I created a statistic for games Started Percentage by dividing Games started by total Games, but the model became less accurate.
 2. Exploring other ways to make a classifier
     - Support Vector Machines
     - Random Forest
     - XGBoost
     - Cross-Validation with 99/1 train/test splits instead of 80/20
     