# All-Star Classifier
### Project Question: 
Using data scraped from basketball reference from the 2017- 2018 NBA season, can we accurately classify whether a player is an All-Star this year?
### Goals:
- Scrape a valid and clean data frame from basketball reference
- Investigate each core variable, understand relationships, and check whether new variables can be created
- Create the best possible model, and use that to predict the All-Stars for the 2018-2019 season

In [2]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
from collections import OrderedDict
import requests
np.random.seed()

### Web Scraping 
Using the python package BeautifulSoup, we can scrape data directly from [www.basketballl-refence.com](www.basketballl-refence.com) and place the data into a pandas DataFrame.

In [3]:
def gather_data(url):
    season = []
    page_req = requests.get(url)
    soup = BeautifulSoup(page_req.text, 'lxml')#'lmxl' tells you how to parse 
    table = soup.find('table')
    table_body = table.find('tbody')
    rows =table_body.findAll('tr')# find all rows of data table
    season=[]
    fields = ['Player','Pos','Age','Tm','G','GS','MP','FG','FGA','FG%','3P','3PA','3P%','2P','2PA','2P%','eFG%','FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PS/G']
    #2P = 2 pointers made, 2PA = 2 pointers attempted, 2P% = 2 pointer percentage
    #Pos = position, TM = team, ORB = offensive rebounds, DRB = defensive rebounds, 
    #eFG% = effective field goal percentage
    for row in rows:
        cell = row.findAll('td')
        if cell:
            cell = [c.text for c in cell]
            entry = OrderedDict(zip(fields, cell))
            season.append(entry)
    return pd.DataFrame(season)

In [4]:
df=gather_data('https://www.basketball-reference.com/leagues/NBA_2018_per_game.html')
df.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PS/G
0,Alex Abrines,SG,24,OKC,75,8,15.1,1.5,3.9,0.395,...,0.848,0.3,1.2,1.5,0.4,0.5,0.1,0.3,1.7,4.7
1,Quincy Acy,PF,27,BRK,70,8,19.4,1.9,5.2,0.356,...,0.817,0.6,3.1,3.7,0.8,0.5,0.4,0.9,2.1,5.9
2,Steven Adams,C,24,OKC,76,76,32.7,5.9,9.4,0.629,...,0.559,5.1,4.0,9.0,1.2,1.2,1.0,1.7,2.8,13.9
3,Bam Adebayo,C,20,MIA,69,19,19.8,2.5,4.9,0.512,...,0.721,1.7,3.8,5.5,1.5,0.5,0.6,1.0,2.0,6.9
4,Arron Afflalo,SG,32,ORL,53,3,12.9,1.2,3.1,0.401,...,0.846,0.1,1.2,1.2,0.6,0.1,0.2,0.4,1.1,3.4


### Data Cleaning
Some of the data must be adjusted in order for our logistic regression to work.

 - Players who have played on multiple teams during the season will have an additional row for each team they played on, as well as a row with their season total. We are only interested in their total data, and will remove the rest
 - If a player has never taken a 3 pointer, their 3P% will be an empty string "". The logistic regression cannot handle this, so we will replace all empty strings with 0.

In [5]:
def prepareDF(df):
    #Create a list of player names with a total row, indicating that they played on multiple teams
    traded=list(df[df.Tm=='TOT'].Player)
    indices=[]
    #If this player played on multiple teams, find row indices containing team-specific data
    for i,row in df.iterrows():
        if row.Player in traded and row.Tm != 'TOT':
            indices.append(i)
    #Remove the irrelevant rows from the data frame
    newDF=df.drop(indices)
    
    newDF=newDF.reset_index(drop=True)
    newDF=newDF.replace("",0) #replaces empty stats with 0
    return newDF

In [6]:
originalData=prepareDF(df)

### Choosing Parameters
One of the benefits of creating many models for the 2017-2018 season is we can find which combination of parameters is the most accurate. Here we drop the unwanted parameters.

 - Drop offensive rebounds and defensive rebounds due to their collinearity with rebounds.
 - Drop player, team, and position because they are strings.
 - Drop games, games started, and age. We will investigate games and games started further in the future

In [7]:
finalData=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','G','GS','Pos'],axis=1)  

### Create a List of All-Star Flags

 - Compile a list of the All-Stars for the 2017-2018 season manually. 
 - Create a list named _ASFlag_ indicating True or False for each respective index of _orignalData_
 
This needs to be a separate list, and not part of the DataFrame for the Logistic Regression function to run properly.

Note that there were 28 All-Stars last year instead of the standard 24 due to injuries.

In [8]:
#create list of True/False to mark all-stars to use in training data
all_stars = [
    'LeBron James','Kevin Durant','Anthony Davis',
    'Kyrie Irving','DeMarcus Cousins','LaMarcus Aldridge',
    'Bradley Beal','Goran Dragic','Andre Drummond','Paul George',
    'Victor Oladipo','Kemba Walker','Russell Westbrook','Kevin Love',
    'Kristaps Porzingis','John Wall','Stephen Curry','James Harden',
    'DeMar DeRozan','Giannis Antetokounmpo','Joel Embiid','Jimmy Butler',
    'Draymond Green','Al Horford','Damian Lillard','Kyle Lowry',
    'Klay Thompson','Karl-Anthony Towns'
]
#want All-Stars as separate list for regression
ASFlag=[]
for name in originalData.Player:
    if name in all_stars:
        ASFlag.append(True)
    else:
        ASFlag.append(False)

### Logistic Regression 
Now that all of your data is prepared, we are ready to train a model.
 - Import the Logistic Regression function from scikit-learn
 - Create relevent helper functions to better describe and display data
 - Design a function to split _finalData_ into training sets and testing sets to train and test our model using an 80/20 split
 - Determine accuracy and viability of the model

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
#Declares an incorrect prediction as False Positive, if a non-Allstar is marked as an All-Star,
#or Missed AllStar, if an Allstar is classified as a non-Allstar
def resType(pred,data):
    if pred==True and data==False:
        return " - False Positive"
    else :
        return " - Missed AllStar"
#Allows the function sorted() to sort a list of tuples by 2nd item of the tuple
def getKey(item):
    return item[1]

In [14]:
#@param df is the data with duplicate names removed, and the appropriate columns dropped (player names, etc.)
#@param original_df is the data with duplicate names removed as returned from prepareDF
#@param displayOutcome determines whether or not the function should print out descriptions of how the model performed 
def classify(df, original_df=originalData, displayOutcome=True):
    #Split ASFlag according to the 80/20 split
    msk = np.random.rand(len(df)) < 0.8
    testing_flags = [ASFlag[i] for i in xrange(len(ASFlag)) if not msk[i]]
    training_flags = [ASFlag[i] for i in xrange(len(ASFlag)) if msk[i]]
    numAllStars = sum(testing_flags)
    #Training set should have at least 20 of the 28 all stars in it
    if numAllStars>7:
        classify(df,original_df,displayOutcome)
    #Split data into training and testing sets according to the 80/20 split
    train = df[msk]
    test = df[~msk]
    clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(train, training_flags)      
    predictions = list(clf.predict(test))
    numCorrect=0
    #list of wrong tuples, in order of predicted then actual
    wrong=[]
    correctAS=[]
    for pred,data,i in zip(predictions,testing_flags,xrange(len(predictions))):
        if pred==data:
            numCorrect+=1
            if pred==True:
                correctAS.append(i)
        else:
            wrong.append((i,pred,data))
    if displayOutcome==True:
        print "Number of Correct Predictions: ", numCorrect
        print "Total Predictions Made: ", len(predictions)
        print numCorrect*1.0/len(predictions), "\n"

        print "Correctly Predicted Allstars:\n"
        for indx in correctAS:
            ogIndex=test.index[indx]
            print original_df.iloc[ogIndex,0]
        print "\nWrong Predictions: \n"
        for i,pred,data in wrong:
            ogIndex=test.index[i]
            print original_df.iloc[ogIndex,0] + resType(pred,data)
    diagnostics=(numCorrect,len(predictions),numAllStars,len(correctAS),clf)
    return diagnostics

In [15]:
classify(finalData,originalData,displayOutcome=True)

Number of Correct Predictions:  112
Total Predictions Made:  115
0.973913043478 

Correctly Predicted Allstars:

Giannis Antetokounmpo
Bradley Beal
Stephen Curry
Karl-Anthony Towns
John Wall

Wrong Predictions: 

Devin Booker - False Positive
Kevin Love - Missed AllStar
Ben Simmons - False Positive


(112,
 115,
 6,
 5,
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='multinomial',
           n_jobs=1, penalty='l2', random_state=0, solver='lbfgs',
           tol=0.0001, verbose=0, warm_start=False))

In [16]:
#Test classifier by running it many times, and examining its accuracy
def testClassifier1(its,df,ogdf=originalData):
    correct=0
    total=0
    correctAllStars=0
    totAllStars=0

    for i in xrange(its):
        clf=classify(df,ogdf,displayOutcome=False)
        correct+=clf[0]
        total+=clf[1]
        totAllStars+=clf[2]
        correctAllStars+=clf[3]
    totAcc = 1.0*correct/total
    asAcc = 1.0*correctAllStars/totAllStars
    
    print "Over ", its, " iterations, the classifier predicted ",correct, "outcomes correctly out of ", total, "total data points for an overall accuracy of ", totAcc ,"."
    print "Of the ", totAllStars, " All-Stars that were classified, ", correctAllStars, " were correctly identified, for an accuracy of ", asAcc, "." 
testClassifier1(1000, finalData)

Over  1000  iterations, the classifier predicted  104927 outcomes correctly out of  108390 total data points for an overall accuracy of  0.96805055817 .
Of the  5579  All-Stars that were classified,  3494  were correctly identified, for an accuracy of  0.626277110593 .


### Making A More Relevant Model
During the Logistic Regression, likelihood values between 0 and 1 are assigned to each player of the testing data, and a likelihood greater than 0.5 is classified as an All-Star. However, if we know there are 6 All-Stars in the testing data, it would be more relevant to find the 6 players most likely to be all-stars.

 - Adjust classifying function to return a more relevant result
 - Previously we see an extremely high overall accuracy
     - It must be relatively easy for the classifier to determing  that most players are not all-stars
     - It should be more important to focus instead on only the All-Star accuracy, which ignores non-All-Stars who are correctly classified

In [17]:
#@param df is the data with duplicate names removed, and the appropriate columns dropped (player names, etc.)
#@param original_df is the data with duplicate names removed as returned from prepareDF
#@param testing determines whether or not the function should print outputs, and return certain variables
def classifyTopX(df, original_df=originalData,testing=False):
    msk = np.random.rand(len(df)) < 0.8
    testing_flags = [ASFlag[i] for i in xrange(len(ASFlag)) if not msk[i]]
    numAllStars = sum(testing_flags) #number of allstars we want (we'll find the best numAllStars)
    #training set should have at least 20 of the 28 all stars in it
    if numAllStars>8:
        return 0,0,0,""
    else:
        if not testing:
            print "We need to predict " + str(numAllStars) + " more All_Stars."
        train = df[msk]
        training_flags = [ASFlag[i] for i in xrange(len(ASFlag)) if msk[i]]
        test = df[~msk]
        clf2 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(train, training_flags)  
        realAllStars=[original_df.iloc[test.index[i],0] for i in xrange(len(test)) if original_df.iloc[test.index[i],0] in all_stars]
        if not testing:
            print "Correct Answers are:"
            for each in realAllStars:
                print each
        #Determine likelihoods (between 0 and 1) for each player
        likelihoods=clf2.predict_proba(test)
        #Sort likelihoods, while keeping the index, which we can use to fetch the player's name
        posLikelihood=[(likelihoods[i][1],i) for i in xrange(len(likelihoods))]
        top=sorted(posLikelihood,reverse=True)
        #compile the likelihoods with player names into a data frame
        t2=[list((original_df.iloc[test.index[indx],0],likelihood)) for likelihood,indx in top]  
        #Also include the 5 runner-ups so we can visualize how close it is / if the  runner ups are understandable
        resultsDF=pd.DataFrame(t2,columns=["Name","Likelihood"]).head(numAllStars+5)      
        predictions = [t2[i][0] for i in xrange(numAllStars)]
        numCorrect=0
        correctAS=[]
        wrong=[]
        #list of wrong tuples, in order of predicted then actual
        for i in xrange(numAllStars):
            if predictions[i] in realAllStars:
                numCorrect+=1
                correctAS.append(predictions[i])
            else:
                wrong.append(predictions[i])

        if not testing:
            print "We predicted " +str(numCorrect) +" correctly out of " + str(numAllStars)+"."
            print "Our "+ str(numAllStars) + " predictions, with the next 5 best predictions are outputted below:"
    if testing==True:
        return [resultsDF,numCorrect,numAllStars,wrong,clf2]
    #return the classifier so we can use it to predict next season if desired
    return  resultsDF,clf2


In [18]:
clf2=classifyTopX(finalData, testing=False)
clf2[0]

We need to predict 6 more All_Stars.
Correct Answers are:
DeMar DeRozan
LeBron James
Damian Lillard
Victor Oladipo
John Wall
Russell Westbrook
We predicted 4 correctly out of 6.
Our 6 predictions, with the next 5 best predictions are outputted below:


Unnamed: 0,Name,Likelihood
0,LeBron James,0.991962
1,Damian Lillard,0.952726
2,Russell Westbrook,0.947232
3,Nikola Jokic,0.891141
4,Chris Paul,0.854222
5,John Wall,0.632471
6,Victor Oladipo,0.608919
7,Ben Simmons,0.51973
8,Marc Gasol,0.490622
9,Lou Williams,0.46869


In [19]:
#to allow sorting by number of false allstars
def sum_repeats(l):
    if len(l)==0:
        return 0
    d=[]
    for entry in l:
        try:
            index=[each[0]for each in d].index(entry)
            d[index][1]+=1
        except ValueError:
            d.append([entry,1])
    return d

In [20]:
def testClassifier2(its,df,ogdf=originalData,output=True):
    correct=0
    total=0
    compile_wrongs=[]
    for i in xrange(its):
        clf=classifyTopX(df,ogdf,testing=True)
        correct+=clf[1]
        total+=clf[2]
        if len(clf[3])>0:
            for each in clf[3]:
                compile_wrongs.append(each)
    totAcc = 1.0*correct/total
    
    print "Over " + str(its) + " iterations, the classifier predicted " + str(correct) + " All-Stars correctly out of " +str(total) + " total attempts." 
    print "The classifier had an overall accuracy of " + str(totAcc) + "."
    if output==True:
        return compile_wrongs
sorted(sum_repeats(testClassifier2(1000, finalData)),key=getKey,reverse=True)

Over 1000 iterations, the classifier predicted 3240 All-Stars correctly out of 4681 total attempts.
The classifier had an overall accuracy of 0.692159794916.


[[u'Blake Griffin', 190],
 [u'Nikola Jokic', 162],
 [u'Devin Booker', 158],
 [u'Chris Paul', 156],
 [u'MarShon Brooks', 125],
 [u'Lou Williams', 113],
 [u'Marc Gasol', 85],
 [u'Ben Simmons', 70],
 [u'Tyreke Evans', 69],
 [u'Jrue Holiday', 49],
 [u'Aaron Gordon', 37],
 [u'Rudy Gobert', 34],
 [u'Nikola Vucevic', 32],
 [u'CJ McCollum', 31],
 [u'Dwight Howard', 31],
 [u'Clint Capela', 12],
 [u'Nikola Mirotic', 12],
 [u'Tobias Harris', 12],
 [u'Harrison Barnes', 8],
 [u'Khris Middleton', 8],
 [u'DeAndre Jordan', 7],
 [u'Eric Bledsoe', 6],
 [u'Kawhi Leonard', 5],
 [u'Hassan Whiteside', 5],
 [u'Eric Gordon', 4],
 [u'Mike Conley', 3],
 [u'Paul Millsap', 2],
 [u'T.J. Warren', 2],
 [u'Dennis Schroder', 2],
 [u'Andre Ingram', 2],
 [u'Donovan Mitchell', 2],
 [u'Lauri Markkanen', 1],
 [u'Kyle Kuzma', 1],
 [u'Carmelo Anthony', 1],
 [u'J.J. Redick', 1],
 [u'Lonzo Ball', 1],
 [u'Otto Porter', 1],
 [u'Jeremy Lin', 1]]

### Common False Positives:
Jokic -  18.5 PTS / 10.7 REB /6.1 AST

Chris Paul - 18.6 / 5.4 / 7.9 

Marc Gasol - 17.3 / 8.1 / 4.2

Ben Simmons - 15.8 / 8.1 / 8.2

Devin Booker - 24.9 / 4.5 / 4.7

Lou Williams - 22.6 / 2.5 / 5.3


### Performance

 - The first classifier prodicts All-Stars correctly with roughly 62% accuracy. When it knows how many All-Stars it is looking for, this increases to around 69% accuracy.
 - The common false positives that repeatedly showed up definitely had All-Star numbers, and all had strong cases to make the team last year, with the possible exceptions of Marshon Brooks and Tyreke Evans.
 - Overall, the classifier is working as intended.


### Potential Adjustments

We are now ready to use our model to predict the All-Stars for the 2018-2019 season, however, if there is room to improve our model's accuracy we must look into it.
 - Try different combinations of parameters. Previously we dropped Minutes Played, Offensive Rebounds, Defensive Rebounds, Age, Games, and Games Started from consideration
 
### Investigation Into Games as a Viable Predictor

In [17]:
#Adjustment  1 -- Keeping games and games started -- ~72 Accuracy
dataAdjustment1=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos'],axis=1)  
testClassifier2(1000, dataAdjustment1,output=False)

Over 1000 iterations, the classifier predicted 3376 All-Stars correctly out of 4697 total attempts.
The classifier had an overall accuracy of 0.718756653183.


In [18]:
#Adjustment  2 -- Replacing Games and Games Started with startedPercent -- ~71% Accuracy
gs=[int(i) for i in originalData['GS']]
g=[int(i) for i in originalData['G']]

startPercent=[gs[i]*1.0/g[i] for i in xrange(len(g))]
originalData['startPercent']=startPercent
dataAdjustment2=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','G','GS'],axis=1)  
testClassifier2(1000, dataAdjustment2,output=False)

#Reset originalData to its initial state (without startPercent)
originalData=prepareDF(df)

Over 1000 iterations, the classifier predicted 3365 All-Stars correctly out of 4736 total attempts.
The classifier had an overall accuracy of 0.710515202703.


In [19]:
#Adjustment  3 --Just Games Started -- ~72.5% Accuracy

dataAdjustment3=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','G',],axis=1)  
testClassifier2(1000, dataAdjustment3,output=False)

Over 1000 iterations, the classifier predicted 3355 All-Stars correctly out of 4629 total attempts.
The classifier had an overall accuracy of 0.724778569886.


In [21]:
#Adjustment  4 --Just Games -- ~69% Accuracy

dataAdjustment4=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','GS',],axis=1)  
testClassifier2(1000, dataAdjustment4,output=False)

Over 1000 iterations, the classifier predicted 3221 All-Stars correctly out of 4666 total attempts.
The classifier had an overall accuracy of 0.690312901843.


### Preliminary Results

The difference between keeping Games and Games started and just using Games Started is hardly significant. We will proceed by keeping just Games Started to Simplify our model.

| Model          | Accuracy |  
|:----------------:|:----------:|
| Games          | 69%        | 
| Neither | 70%        |  
| startedPercent        | 71%       |   
| Games and Games Started        | 72%       |  
| Games Started        | 72.5%       |   

### Field Goal% vs Attempts and Made


In [22]:
#Adjustment  5 -- Removing 2P% -- 72.4%

dataAdjustment5=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','2P%','G'],axis=1)  
testClassifier2(1000, dataAdjustment5,output=False)

Over 1000 iterations, the classifier predicted 3493 All-Stars correctly out of 4826 total attempts.
The classifier had an overall accuracy of 0.723787815997.


In [23]:
#Adjustment  6 -- Removing 2PA and 2P --71.5%

dataAdjustment6=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','2PA','2P','G'],axis=1)  
testClassifier2(1000, dataAdjustment6,output=False)

Over 1000 iterations, the classifier predicted 3440 All-Stars correctly out of 4811 total attempts.
The classifier had an overall accuracy of 0.715028060694.


In [24]:
#Adjustment  7 -- Only Keeping 2PM -- 71.3%

dataAdjustment7=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','2PA','2P%','G'],axis=1)  
testClassifier2(1000, dataAdjustment7,output=False)

Over 1000 iterations, the classifier predicted 3388 All-Stars correctly out of 4749 total attempts.
The classifier had an overall accuracy of 0.713413350179.


In [25]:
#Adjustment  7_b -- no 2P data -- 71.6%

dataAdjustment7_b=originalData.drop(['Player','MP','Tm', 'ORB','DRB','Age','Pos','2PA','2P','2P%','G'],axis=1)  
testClassifier2(1000, dataAdjustment7_b,output=False)

Over 1000 iterations, the classifier predicted 3371 All-Stars correctly out of 4706 total attempts.
The classifier had an overall accuracy of 0.71631959201.


### Takeaways

 - Removing 2P% has no significant effect on the model, as expected
 - Removing 2PA and 2P produces similar results to removing 2P% and 2PA, and competely ignoring 2-pointer data
 - Removing 2P% is the only model that doesn't see a decrease in accuracy
 
We will proceed with Adjustment 5, as it simplified our model while maintaining a similar level of accuracy

### Repeat this process with Free Throws, and 3-Pointers, comparing to a baseline of ~72.5%

In [32]:
#Adjustments 8 -10 -- Free Throws and 3-Pointers
print "Removing FT%"
dataAdjustment8=originalData.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','FT%' ,'G'],axis=1)  
testClassifier2(1000, dataAdjustment8,output=False)

print "\nRemoving FT, FTA"
dataAdjustment9=originalData.drop(['Player','Tm', 'MP','ORB','DRB','Age','Pos','2P%','FT','FTA','G'],axis=1)  
testClassifier2(1000, dataAdjustment9,output=False)

print "\nRemoving FT, FTA, FT%'"
dataAdjustment10=originalData.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','FT','FTA','FT%','G'],axis=1)  
testClassifier2(1000, dataAdjustment10,output=False)

Removing FT%
Over 1000 iterations, the classifier predicted 3316 All-Stars correctly out of 4713 total attempts.
The classifier had an overall accuracy of 0.703585826438.

Removing FT, FTA
Over 1000 iterations, the classifier predicted 3375 All-Stars correctly out of 4711 total attempts.
The classifier had an overall accuracy of 0.716408405859.

Removing FT, FTA, FT%'
Over 1000 iterations, the classifier predicted 3540 All-Stars correctly out of 4825 total attempts.
The classifier had an overall accuracy of 0.733678756477.


### Takeaways
- Removing all free throw related stats improves out model to 73.3% accuracy
- Moving forward we will ignore all free throw related stats. A simpler model is preferable to a more complicated one

In [34]:
#Adjustments 11-13 -- Free Throws and 3-Pointers
print "Removing 3P% 3PA"
dataAdjustment11=originalData.drop(['Player','Tm', 'MP','ORB','DRB','Age','Pos','2P%','FT','FTA','FT%','3P%', '3PA','G'],axis=1)  
testClassifier2(1000, dataAdjustment11,output=False)

print "\nRemoving 3P, 3PA"
dataAdjustment12=originalData.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','FT','FTA','FT%','3P','3PA','G'],axis=1)  
testClassifier2(1000, dataAdjustment12,output=False)

print "\nRemoving 3P, 3PA, 3P%"
dataAdjustment13=originalData.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','FT','FTA','FT%','3P','3PA','3P%','G'],axis=1)  
testClassifier2(1000, dataAdjustment13,output=False)



Removing 3P% 3PA
Over 1000 iterations, the classifier predicted 3445 All-Stars correctly out of 4718 total attempts.
The classifier had an overall accuracy of 0.730182280627.

Removing 3P, 3PA
Over 1000 iterations, the classifier predicted 3361 All-Stars correctly out of 4687 total attempts.
The classifier had an overall accuracy of 0.717089822914.

Removing 3P, 3PA, 3P%
Over 1000 iterations, the classifier predicted 3427 All-Stars correctly out of 4675 total attempts.
The classifier had an overall accuracy of 0.733048128342.


 - Again, we see that these percentages just don't seem to help our classifier much, so we can ignore them completely.
 
## Minutes Played and Personal Fouls

In [36]:
print "Keeping MP"
dataAdjustment14=originalData.drop(['Player','Tm', 'ORB','DRB','Age','Pos','2P%','FT','FTA','FT%','3P','3PA','3P%','G'],axis=1)  
testClassifier2(1000, dataAdjustment14,output=False)

Keeping MP
Over 1000 iterations, the classifier predicted 3394 All-Stars correctly out of 4686 total attempts.
The classifier had an overall accuracy of 0.724285104567.


 - This model showed no improvement, so we will continue ignoring minutes played

In [37]:
print "Removing Fouls"
dataAdjustment15=originalData.drop(['Player','Tm', 'MP','ORB','DRB','Age','Pos','2P%','FT','FTA','PF','3P','3PA','3P%','PF','G'],axis=1)  
testClassifier2(1000, dataAdjustment15,output=False)

Removing Fouls
Over 1000 iterations, the classifier predicted 3456 All-Stars correctly out of 4743 total attempts.
The classifier had an overall accuracy of 0.728652751423.


 - This model is quite good, but overall slightly less accurate than the 73.3% we have seen

In [38]:
print "Removing eFG%\n"
dataAdjustment16=originalData.drop(['Player','Tm', 'MP','ORB','DRB','Age','Pos','2P%','2PA','FT','FTA','3P','3PA','3P%','PF','eFG%','G'],axis=1)  
testClassifier2(1000, dataAdjustment16,output=False)

Removing eFG%

Over 1000 iterations, the classifier predicted 3489 All-Stars correctly out of 4807 total attempts.
The classifier had an overall accuracy of 0.725816517579.


 - Removing effective Field Goal Percentage definitely did not improve our model.
 - There are hundreds of possible combinations of parameters we could check, but these seemed like the most reasonable to test and compare
 - We will proceed by using dataAdjustment13 as the data to train our classifier with
 

## Predicting the 2018-2019 All-Stars
 - Gather data for the 2018-2019 season (up end of day 1/27/2019)
 - We will now use all of the data from 2017-2018 to train the classifer instead of splitting it 80/20
 - insert the 2018-2019 season data as the testing data, and predict the All-Star teams for this year

In [39]:
#get 2018-2019 data 
data2019=gather_data('https://www.basketball-reference.com/leagues/NBA_2019_per_game.html')

In [40]:
og2019=prepareDF(data2019)

In [42]:
cleaned_2019=og2019.drop(['Player','Tm','MP', 'ORB','DRB','Age','Pos','2P%','FT','FTA','FT%','3P','3PA','3P%','G'],axis=1)

In [45]:
clf_final = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(dataAdjustment13, ASFlag)  
probs=clf_final.predict_proba(cleaned_2019)
posLikelihood_2019=[(probs[i][1],i) for i in xrange(len(probs))]
sortedLikelihoods=sorted(posLikelihood_2019,reverse=True)
playerWeights = [list((og2019.iloc[cleaned_2019.index[indx],0],likelihood)) for likelihood,indx in sortedLikelihoods]
df2019=pd.DataFrame(playerWeights, columns=["Name","Likelihood"])
df2019.head(24)

Unnamed: 0,Name,Likelihood
0,James Harden,0.986255
1,Anthony Davis,0.984148
2,Kevin Durant,0.930875
3,Stephen Curry,0.901792
4,LeBron James,0.896558
5,Joel Embiid,0.888503
6,Damian Lillard,0.861167
7,Kawhi Leonard,0.8169
8,Paul George,0.778828
9,Kemba Walker,0.778414


## Results

Overall, the 24 predicted All-Stars are all great players, and I could absolutely see this being the lineup.

Some notable snubs are Ben Simmons, LaMarcus Aldridge, and Jimmy Butler. It's also strange that Russel Westbrook is so low, considering that he is averaging a triple double.

### Future Work
 1. Tweeking the Model
     - Using a different combination of parameters, including advanced stats such as Player Efficiency Rating (PER), or Win Shares
     - Creating more of our own statistics
          - For example, I created a statistic for games started percentage by dividing games started by total games, but the model became less accurate.
 2. Exploring other ways to make a classifier
     - Support Vector Machines
     - Random Forest
     - XGBoost
     - Cross-Validation with 99/1 train/test splits
 3. Using more relevant data
     - Creating a separate model for guards and the frontcourt
     - Creating a separate model for East vs. West
     - Only using data from the 2017-2018 season before the All-Star Game
 4. Addressing Limitations
     - It is entirely possible that, although we removed free throw and 3 pointer statistics, there is some obscure combination of parameters which could produce a better model
     - Each time I tested the accuracy, it would there would be fluctuations of +/- 1%
         - I tried to address this by running it 1,000 times, but they still occur
         - Increasing that number would make the run time of the notebook unreasonable, but in the future could be pursued