# Team Score Prediction

- `import numpy as np`: Numerical computations library.
- `import pandas as pd`: Data manipulation and analysis library.
- `import os`: Interaction with the operating system.
- `import seaborn as sns`: Statistical data visualization library.
- `from datetime import datetime`: Date and time manipulation.
- `from sklearn.preprocessing import LabelEncoder`: Encoding categorical variables.
- `from sklearn.linear_model import LinearRegression, LogisticRegression`: Linear and logistic regression models.
- `from sklearn.metrics import ...`: Evaluation metrics (accuracy, confusion matrix, precision, recall, etc.).
- `from sklearn.pipeline import Pipeline`: Data processing pipeline.
- `import matplotlib.pyplot as plt`: Data visualization.
- `from sklearn.model_selection import GridSearchCV`: Hyperparameter tuning using grid search.
- `from sklearn.preprocessing import StandardScaler, MinMaxScaler`: Feature scaling.
- `from sklearn.ensemble import RandomForestRegressor`: Random Forest Regressor model.
- `from sklearn.tree import DecisionTreeRegressor`: Decision Tree Regressor model.
- `from sklearn.linear_model import Ridge, Lasso`: Ridge and Lasso regression models.
- `from keras.losses import mean_squared_error`: Loss function for Keras.
- `import tensorflow as tf`: TensorFlow library for machine learning.
- `from tensorflow import keras`: High-level neural networks API.
- `import pickle`: Serialization and deserialization of Python objects.
- `import requests`: Making HTTP requests.
- `from bs4 import BeautifulSoup`: Web scraping library.
- `from sklearn.compose import ColumnTransformer`: Transforming columns in a dataset.
- `from sklearn.preprocessing import OneHotEncoder`: One-hot encoding categorical features.
- `from tensorflow.keras.models import Sequential`: Building sequential models in Keras.
- `from tensorflow.keras.layers import LSTM, Dense`: LSTM and Dense layers for neural networks.
- `import warnings`: Handling warnings.
- `warnings.filterwarnings("ignore")`: Suppressing warnings.


In [None]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Lasso
from keras.losses import mean_squared_error
from sklearn.metrics import r2_score
import tensorflow as tf
from tensorflow import keras
import pickle
from sklearn import metrics
from sklearn.model_selection import train_test_split
import requests
from bs4 import BeautifulSoup
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
import warnings
warnings.filterwarnings("ignore")

In [None]:
np.random.seed(0)

In [None]:
pd.set_option('display.max_columns', None)

This dataset is sourced from Kaggle and is titled "ICC Men's World Cup 2023 Dataset." It contains a file named 'deliveries.csv,' which comprises data detailing each ball bowled in all matches played during this World Cup. I have utilized this dataset to predict the score a team is likely to achieve while batting.

In [None]:
ball=pd.read_csv("/kaggle/input/icc-mens-world-cup-2023/deliveries.csv")

In [None]:
ball

- **Filtering Rows with Non-null 'penalty':**
  Filters the DataFrame `ball` to display rows where the 'penalty' column is not null. This operation identifies and shows rows containing data related to penalties.

- **Unique Values in 'season' Column:**
  Retrieves the unique values present in the 'season' column of the DataFrame `ball`. It returns a list of distinct season values in the dataset.

- **Unique Dates in 'start_date' Column:**
  Displays the unique dates found in the 'start_date' column of the DataFrame `ball`. This operation shows a list of distinct dates present in the dataset.

- **Unique Values in 'wides' Column:**
  Lists the unique values present in the 'wides' column of the DataFrame `ball`. It shows different types or categories of wides in the dataset.

- **Fill Missing Values with Zeros:**
  Fills missing values with zeros in specific columns ('wides', 'noballs', 'byes', 'legbyes') of the DataFrame `ball`. This ensures these columns contain numerical data and replaces any missing values with zeros.

- **Unique Values in 'wicket_type' Column:**
  Retrieves the unique values found in the 'wicket_type' column of the DataFrame `ball`. This operation displays various types of dismissals or wicket events that occurred during matches.


In [None]:
ball[ball['penalty'].notnull()]

In [None]:
ball['season'].unique()

In [None]:
ball['start_date'].unique()

In [None]:
ball['wides'].unique()

In [None]:
ball[['wides','noballs','byes' ,'legbyes']]=ball[['wides','noballs','byes' ,'legbyes']].fillna(0)

In [None]:
ball

In [None]:
ball['wicket_type'].unique()

- **Creating a New Column 'wicket':**
  The code snippet adds a new column 'wicket' to the DataFrame `ball`. It utilizes the 'wicket_type' column to determine whether a wicket occurred during a specific event. 

- **Lambda Function Application:**
  Using the `apply()` function in combination with a lambda function, it checks each value in the 'wicket_type' column. If the value is not a null value (`pd.isna(x)` checks for NaN values), it assigns the value 1 to the 'wicket' column. Otherwise, it assigns the value 0.

In [None]:
ball['wicket'] = ball['wicket_type'].apply(lambda x: 1 if not pd.isna(x) else 0)

In [None]:
ball

In [None]:
ball.drop(['penalty' , 'wicket_type' , 'player_dismissed' , 'other_wicket_type' , 'other_player_dismissed','season'], axis=1, inplace=True)

- **Grouping Data and Calculating Sums:**
  - `result = ball.groupby(['match_id', 'innings'])[['runs_off_bat', 'extras', 'wides', 'noballs', 'byes', 'legbyes']].sum()`
    - Groups the data in `ball` by 'match_id' and 'innings'.
    - Calculates the sum of numerical columns 'runs_off_bat', 'extras', 'wides', 'noballs', 'byes', 'legbyes' for each group.
    - Stores the aggregated results in the DataFrame `result`.

- **Resetting Index:**
  - `result = result.reset_index()`
    - Resets the index of the DataFrame `result` after the groupby operation.

- **Dropping Unnecessary Columns:**
  - `result.drop(['noballs' , 'byes' , 'legbyes','wides'], axis=1, inplace=True)`
    - Removes columns 'noballs', 'byes', 'legbyes', 'wides' from the DataFrame `result`.

- **Calculating Total Runs:**
  - `result['total'] = result['runs_off_bat'] + result['extras']`
    - Creates a new column 'total' in `result` containing the sum of 'runs_off_bat' and 'extras'.

- **Dropping Individual Runs Columns:**
  - `result.drop(['runs_off_bat' , 'extras'], axis=1, inplace=True)`
    - Removes columns 'runs_off_bat' and 'extras' from the DataFrame `result`.

In [None]:
result = ball.groupby(['match_id', 'innings'])[['runs_off_bat', 'extras', 'wides', 'noballs', 'byes', 'legbyes']].sum()

result = result.reset_index()
result.drop(['noballs' , 'byes' , 'legbyes','wides'], axis=1, inplace=True)
result['total'] = result['runs_off_bat'] + result['extras']
result.drop(['runs_off_bat' , 'extras'], axis=1, inplace=True)
result.head(10)

- Merges the DataFrame `ball` with the DataFrame `result` based on the common columns 'match_id' and 'innings'.
- Adds columns from the `result` DataFrame to the `ball` DataFrame based on matching values in 'match_id' and 'innings'.

In [None]:
ball = ball.merge(result, on=['match_id', 'innings'])
ball

- **Dropping Specific Columns:**
  - `ball.drop(['noballs', 'byes', 'legbyes', 'wides'], axis=1, inplace=True)`
    - Removes columns 'noballs', 'byes', 'legbyes', and 'wides' from the DataFrame `ball`.

- **Calculating Cumulative Runs and Wickets per Match and Inning:**
  - `ball['cumulative_runs'] = ball.groupby(['match_id', 'innings'])['runs_off_bat'].cumsum() + ball.groupby(['match_id', 'innings'])['extras'].cumsum()`
    - Computes the cumulative sum of 'runs_off_bat' and 'extras' columns within each group defined by 'match_id' and 'innings', storing the result in the 'cumulative_runs' column.
  - `ball['wickets'] = ball.groupby(['match_id', 'innings'])['wicket'].cumsum()`
    - Calculates the cumulative sum of 'wicket' column within each group defined by 'match_id' and 'innings', storing the result in the 'wickets' column.

- **Dropping Additional Columns:**
  - `ball.drop(['striker', 'non_striker', 'bowler', 'match_id', 'runs_off_bat', 'extras', 'innings', 'wicket', 'start_date'], axis=1, inplace=True)`
    - Removes columns 'striker', 'non_striker', 'bowler', 'match_id', 'runs_off_bat', 'extras', 'innings', 'wicket', and 'start_date' from the DataFrame `ball`.

These operations modify the `ball` DataFrame by dropping specific columns and calculating cumulative runs and wickets per match and inning.

In [None]:
ball.drop(['noballs' , 'byes' , 'legbyes','wides'], axis=1, inplace=True)

In [None]:
ball['cumulative_runs'] = ball.groupby(['match_id', 'innings'])['runs_off_bat'].cumsum() + ball.groupby(['match_id', 'innings'])['extras'].cumsum()

In [None]:
ball['wickets'] = ball.groupby(['match_id', 'innings'])['wicket'].cumsum() 

In [None]:
ball

In [None]:
ball.drop(['striker' , 'non_striker' , 'bowler','match_id','runs_off_bat','extras','innings','wicket','start_date'], axis=1, inplace=True)

In [None]:
ball

- **Obtaining Unique Stadium Names:**
  - `stadiums = ball['venue'].unique()`
    - Retrieves unique stadium names from the 'venue' column of the DataFrame `ball`.

- **Creating a Stadium Dictionary:**
  - `stadium_dict = { stadium : i + 1 for i, stadium in enumerate(stadiums)}`
    - Generates a dictionary `stadium_dict` where each unique stadium name is assigned a numerical value (incremented by 1) based on its position in the unique stadium names list.

- **Mapping Numerical Values to Stadium Names in 'venue' Column:**
  - `ball['venue'] = ball['venue'].map(stadium_dict)`
    - Maps the numerical values from `stadium_dict` to replace stadium names in the 'venue' column of the DataFrame `ball`.

- **Encoding Categorical Columns with One-Hot Encoding:**
  - `ball = pd.get_dummies(ball, columns=['batting_team', 'bowling_team'], dtype=int)`
    - Utilizes one-hot encoding via `pd.get_dummies()` to convert categorical variables 'batting_team' and 'bowling_team' into binary encoded columns in the DataFrame `ball`.
    - The original categorical columns are replaced with their binary representations for further analysis.


In [None]:
stadiums=ball['venue'].unique()
stadium_dict = { stadium : i + 1 for i, stadium in enumerate(stadiums)}
stadium_dict

In [None]:
ball['venue'] = ball['venue'].map(stadium_dict)

In [None]:
ball = pd.get_dummies(ball, columns=['batting_team', 'bowling_team'], dtype=int)

In [None]:
ball.columns

In [None]:
col=['batting_team_Afghanistan', 'batting_team_Australia',
       'batting_team_Bangladesh', 'batting_team_England', 'batting_team_India',
       'batting_team_Netherlands', 'batting_team_New Zealand',
       'batting_team_Pakistan', 'batting_team_South Africa',
       'batting_team_Sri Lanka', 'bowling_team_Afghanistan',
       'bowling_team_Australia', 'bowling_team_Bangladesh',
       'bowling_team_England', 'bowling_team_India',
       'bowling_team_Netherlands', 'bowling_team_New Zealand',
       'bowling_team_Pakistan', 'bowling_team_South Africa',
       'bowling_team_Sri Lanka','venue', 'ball',  'cumulative_runs', 'wickets','total']

In [None]:
ball=ball[col]
ball.head()

- **Creating Training Set (`X_train`):**
  - `X_train = ball.drop(labels='total', axis=1)[ball['ball'] < 30]`
    - Generates the training set `X_train` by excluding the 'total' column from `ball` DataFrame where the number of overs is less than 30.

- **Creating Validation Set (`X_val`):**
  - `X_val = ball.drop(labels='total', axis=1)[(ball['ball'] >= 30) & (ball['ball'] < 40)]`
    - Produces the validation set `X_val` by excluding the 'total' column from `ball` DataFrame where the number of overs is between 30 (inclusive) and 40 (exclusive).

- **Creating Test Set (`X_test`):**
  - `X_test = ball.drop(labels='total', axis=1)[ball['ball'] >= 40]`
    - Constructs the test set `X_test` by excluding the 'total' column from `ball` DataFrame where the number of overs is 40 or more.

In [None]:
X_train = ball.drop(labels='total', axis=1)[ball['ball']< 30]
X_val = ball.drop(labels='total', axis=1)[(ball['ball']>= 30) & (ball['ball'] < 40) ]
X_test = ball.drop(labels='total', axis=1)[ball['ball']>=40 ]

In [None]:
X_train

In [None]:
X_val

In [None]:
X_test

In [None]:
y_train = ball[ball['ball']< 30]['total'].values
y_val = ball[(ball['ball']>= 30) & (ball['ball'] < 40)]['total'].values
y_test = ball[ball['ball']>= 40]['total'].values

In [None]:
regressor = LinearRegression()
regressor.fit(X_train,y_train)

In [None]:
y_pred = regressor.predict(X_test).round(0).astype(int)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

In [None]:
ridge=Ridge()
parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40]}
ridge_regressor=GridSearchCV(ridge,parameters,scoring='neg_mean_squared_error',cv=5)
ridge_regressor.fit(X_train,y_train)

In [None]:
y_pred = ridge_regressor.predict(X_test).round(0).astype(int)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

In [None]:
print(ridge_regressor.best_params_)
print(ridge_regressor.best_score_)

In [None]:
sns.distplot(y_test-y_pred)

In [None]:
lasso=Lasso()
parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40]}
lasso_regressor=GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error',cv=5)

lasso_regressor.fit(X_train,y_train)
print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)

In [None]:
y_pred = lasso_regressor.predict(X_test).round(0).astype(int)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

In [None]:
sns.distplot(y_test-y_pred)

In [None]:
tree = DecisionTreeRegressor(max_depth=17,random_state=42)  
tree.fit(X_train, y_train)

In [None]:
y_pred = tree.predict(X_test).round(0).astype(int)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

In [None]:
regressor = RandomForestRegressor(n_estimators=150, random_state=43)  
regressor.fit(X_train, y_train)

In [None]:
y_pred = regressor.predict(X_test).round(0).astype(int)


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

In [None]:
src = keras.Sequential([
    keras.layers.Dense(16, input_shape=(X_train.shape[1],), activation='relu'),
    keras.layers.Dense(8, activation='relu'),
    keras.layers.Dense(8, activation='relu'),
    keras.layers.Dense(4, activation='relu'),
    keras.layers.Dense(2, activation='relu'),
    keras.layers.Dense(1)  
])

src.compile(optimizer='adam', loss='mean_squared_error')
src.fit(X_train, y_train, epochs=150, batch_size=32,validation_data=(X_val, y_val))

In [None]:
y_pred = src.predict(X_test).round(0).astype(int)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

# ODI Match Winner

In [None]:
total_score = pd.read_csv('/kaggle/input/icc-mens-world-cup-2023/deliveries.csv')

In [None]:
match = pd.read_csv('/kaggle/input/icc-mens-world-cup-2023/matches.csv')

In [None]:
total_score.tail()

In [None]:
match.tail()

In [None]:
result = total_score.groupby(['match_id', 'innings'])[['runs_off_bat', 'extras']].sum()

result = result.reset_index()

result['total'] = result['runs_off_bat'] + result['extras'] + 1
result.drop(['runs_off_bat' , 'extras'], axis=1, inplace=True)
result.head(10)

In [None]:
result = result[result['innings'] == 1]

In [None]:
match_final = match.merge(result[['match_id', 'total']], left_on = 'match_number', right_on = 'match_id')

In [None]:
match_final.head()

In [None]:
final_cricket = match_final[['match_id','winner','total']].merge(total_score, on = 'match_id')

In [None]:
final_cricket

In [None]:
final_cricket = final_cricket[final_cricket['innings'] == 2]

In [None]:
final_cricket['cumulative_runs'] = final_cricket.groupby(['match_id', 'innings'])['runs_off_bat'].cumsum() + final_cricket.groupby(['match_id', 'innings'])['extras'].cumsum()

In [None]:
final_cricket.head()

In [None]:
final_cricket['runs_left'] = final_cricket['total'] - final_cricket['cumulative_runs']

In [None]:
final_cricket['ball']

In [None]:
balls_played = final_cricket['ball'].apply(lambda x: int(x) * 6 + int((x * 10) % 10))

In [None]:
final_cricket['balls_left'] = 300 - balls_played

In [None]:
final_cricket.head()

In [None]:
final_cricket['wickets_left'] = final_cricket['wicket_type'].apply(lambda x: 1 if not pd.isna(x) else 0)
final_cricket['wickets_left'] = final_cricket.groupby(['match_id', 'innings'])['wickets_left'].cumsum()
final_cricket['wickets_left'] = 10 - final_cricket['wickets_left']

In [None]:
final_cricket['crr'] = final_cricket['cumulative_runs'] * 6 / (300 - final_cricket['balls_left'])

In [None]:
final_cricket['rrr'] = final_cricket['runs_left'] * 6 / final_cricket['balls_left']

In [None]:
final_cricket['result'] = final_cricket.apply(lambda row: 1 if row['batting_team'] == row['winner'] else 0, axis=1)

In [None]:
match_winner = final_cricket[['batting_team', 'bowling_team', 'venue', 'runs_left', 'balls_left', 'wickets_left', 'total', 'crr', 'rrr', 'result']]

In [None]:
match_winner = match_winner.sample(match_winner.shape[0])

In [None]:
match_winner = match_winner[match_winner['balls_left'] != 0]

In [None]:
match_winner

In [None]:
X = match_winner.iloc[:,:-1]
y = match_winner.iloc[:,-1]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
trf = ColumnTransformer([
    ('trf', OneHotEncoder(sparse=False, drop='first'), ['batting_team', 'bowling_team', 'venue'])
], remainder='passthrough')

In [None]:
def final():
    winner = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(33,)), 
        tf.keras.layers.Dense(16, activation='relu'),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    winner.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
    return winner

In [None]:
model = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn = final, epochs = 10, batch_size = 32)

In [None]:
pipe = Pipeline(steps=[
    ('step1', trf),
    ('step2', model)
])

In [None]:
pipe.fit(X_train, y_train)

In [None]:
y_pred = pipe.predict(X_test)
y_prob = pipe.predict_proba(X_test)

In [None]:
y_prob[11]

In [None]:
X_train.describe()

In [None]:
'''logistic_regression = LogisticRegression(solver='saga', max_iter=200)
logistic_regression.fit(X_train, y_train)
y_pred = logistic_regression.predict(X_test)'''

In [None]:
#logistic_regression.predict_proba(X_test)[10]

In [None]:
#winner.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

In [None]:
'''y_pred = winner.predict(X_test)
y_pred = (y_pred > 0.5).astype(int)'''

In [None]:
#np.column_stack((y_pred, 1-y_pred))

In [None]:
#1-y_pred

In [None]:
def match_summary(row):
    print("Batting Team-" + row['batting_team'] + " | Bowling Team-" + row['bowling_team'] + " | Target- " + str(row['total']))

In [None]:
final_cricket.head(2)

In [None]:
def match_progression(x_df,match_id,pipe):
    match = x_df[x_df['match_id'] == match_id]
    match['fractional_part'] = match['ball'].apply(lambda x: x - int(x))
    tolerance = 1e-10
    match = match[np.isclose(match['fractional_part'], 0.6, atol=tolerance)]
    temp_df = match[['batting_team','bowling_team','venue','runs_left','balls_left','wickets_left','total','crr','rrr']].dropna()
    temp_df = temp_df[temp_df['balls_left'] != 0]
    result = pipe.predict_proba(temp_df)
    temp_df['lose'] = np.round(result.T[0]*100,1)
    temp_df['win'] = np.round(result.T[1]*100,1)
    temp_df['end_of_over'] = range(1,temp_df.shape[0]+1)
    
    target = temp_df['total'].values[0]
    runs = list(temp_df['runs_left'].values)
    new_runs = runs[:]
    runs.insert(0,target)
    temp_df['runs_after_over'] = np.array(runs)[:-1] - np.array(new_runs)
    wickets = list(temp_df['wickets_left'].values)
    new_wickets = wickets[:]
    new_wickets.insert(0,10)
    wickets.append(0)
    w = np.array(wickets)
    nw = np.array(new_wickets)
    temp_df['wickets_in_over'] = (nw - w)[0:temp_df.shape[0]]
    
    print("Target-",target)
    temp_df = temp_df[['end_of_over','runs_after_over','wickets_in_over','lose','win']]
    return temp_df,target

In [None]:
temp_df,target = match_progression(final_cricket,9,pipe)
temp_df

In [None]:
plt.figure(figsize=(18,8))
plt.plot(temp_df['end_of_over'],temp_df['wickets_in_over'],color='yellow',linewidth=3)
plt.plot(temp_df['end_of_over'],temp_df['win'],color='#00a65a',linewidth=4)
plt.plot(temp_df['end_of_over'],temp_df['lose'],color='red',linewidth=4)
plt.bar(temp_df['end_of_over'],temp_df['runs_after_over'])
plt.title('Target-' + str(target))

In [None]:
teams = sorted(match_winner['batting_team'].unique())

In [None]:
venue = sorted(match_winner['venue'].unique())

In [None]:
pickle.dump(pipe, open('pipe.pkl','wb'))

# Team Composition

In [None]:
np.random.seed(0)

In [None]:
url = 'http://howstat.com/cricket/Statistics/WorldCup/SeriesAnalysis.asp?SeriesCode=1117'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')

    batting_table = soup.find_all('table', {'class': 'TableLined'})[0] 
    
    data = []

    for row in batting_table.find_all('tr'):
        row_data = []

        for cell in row.find_all(['td', 'th']):
            cell_value = cell.text.strip().replace('\n', '').replace('\t', '').replace('\r', '')
  
            if cell_value == '-':
                cell_value = np.nan
        
            if not cell_value:
                cell_value = np.nan
            row_data.append(cell_value)
        
        data.append(row_data)

    df = pd.DataFrame(data)

    if not df.empty and df.iloc[0].count() > 0:
        df.columns = df.iloc[0]
        df = df[1:]

    display(df)
else:
    print('Failed to retrieve the webpage')

In [None]:
batting = df.copy()

In [None]:
url = 'http://howstat.com/cricket/Statistics/WorldCup/SeriesAnalysis.asp?SeriesCode=1117'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')

    bowling_table = soup.find_all('table', {'class': 'TableLined'})[1] 
    
    data = []

    for row in bowling_table.find_all('tr'):
        row_data = []

        for cell in row.find_all(['td', 'th']):
            cell_value = cell.text.strip().replace('\n', '').replace('\t', '').replace('\r', '')
            if not cell_value:
                cell_value = np.nan
            row_data.append(cell_value)

        data.append(row_data)

    df = pd.DataFrame(data)

    if not df.empty and df.iloc[0].count() > 0:
        df.columns = df.iloc[0]
        df = df[1:] 

    display(df)
else:
    print('Failed to retrieve the webpage')

In [None]:
batting[batting['Country'] == 'India']

In [None]:
bowling = df.copy()

In [None]:
batting_players = set(batting['Player'])
bowling_players = set(bowling['Player'])

is_subset = bowling_players.issubset(batting_players)

print(is_subset)

In [None]:
bowling = bowling.drop(['Country'], axis=1)

In [None]:
batting = batting.rename(columns={'S/R': 'Bat_S/R', 'Avg': 'Bat_Avg'})
bowling = bowling.rename(columns={'S/R': 'Bowl_S/R', 'Avg': 'Bowl_Avg', 'Mat': 'Inns'})

In [None]:
player_stats = pd.merge(batting, bowling, on='Player', how='outer', suffixes=('_batting', '_bowling'))

In [None]:
player_stats['Inns'] = player_stats['Inns_bowling'].combine_first(player_stats['Inns_batting'])

In [None]:
player_stats = player_stats.drop(['Inns_batting', 'Inns_bowling'], axis=1)

In [None]:
player_stats[player_stats['Country'] == 'India']

In [None]:
player_stats['Country'].unique()

In [None]:
player_stats[player_stats['Country'] == 'England']

In [None]:
player_stats[player_stats['Country'] == 'Netherlands'].shape

In [None]:
player_stats['HS'] = player_stats['HS'].str.replace(r'\*$', '', regex=True)

In [None]:
type(player_stats['Mat'][1])

In [None]:
player_stats['Best'] = player_stats['Best'].str.split('/').str[0]

In [None]:
player_stats['% Team Runs'] = player_stats['% Team Runs'].str.rstrip('%')

In [None]:
player_stats = player_stats.fillna(0)

In [None]:
player_stats

In [None]:
player_stats[['Mat', 'Inns', 'NO', '50s', '100s', '0s', 'HS', 'Runs', 'Bat_S/R', 'Bat_Avg', 'Ca', 'St', '% Team Runs', 'O', 'M', 'R', 'W', '4w', 'Best', 'Bowl_Avg', 'Bowl_S/R', 'E/R']] = player_stats[['Mat', 'Inns', 'NO', '50s', '100s', '0s', 'HS', 'Runs', 'Bat_S/R', 'Bat_Avg', 'Ca', 'St', '% Team Runs', 'O', 'M', 'R', 'W', '4w', 'Best', 'Bowl_Avg', 'Bowl_S/R', 'E/R']].apply(pd.to_numeric)

In [None]:
player_stats = player_stats.reset_index(drop=True)

In [None]:
player_stats

In [None]:
features = ['NO', '50s', '100s', '0s', 'HS', 'Runs', 'Bat_S/R', 'Bat_Avg', 'Ca', 'St', '% Team Runs', 'O', 'M', 'R', 'W', '4w', 'Best', 'Bowl_Avg', 'Bowl_S/R', 'E/R']

In [None]:
player_stats['PlayingXI'] = player_stats['Inns'].apply(lambda x: 0 if (9 - x)/9 > 0.4 else 1)

In [None]:
player_stats[player_stats['Country'] == 'India']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(player_stats[features], player_stats['PlayingXI'], test_size=0.2, random_state=0)

In [None]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
play_xi = tf.keras.Sequential([
        tf.keras.layers.Input(len(features),), 
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Dense(16, activation='relu'),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Dense(8, activation='relu'),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

play_xi.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
play_xi.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2)

In [None]:
accuracy = play_xi.evaluate(X_test_scaled, y_test)[1]
accuracy

In [None]:
train_probabilities = play_xi.predict(X_train_scaled)
test_probabilities = play_xi.predict(X_test_scaled)

In [None]:
'''player_stats.loc[X_train.index, 'PredictedPlayingXI'] = (train_probabilities > 0.5).astype(int)
player_stats.loc[X_test.index, 'PredictedPlayingXI'] = (test_probabilities > 0.5).astype(int)'''

In [None]:
player_stats.loc[X_train.index, 'PredictedProbabilities'] = train_probabilities
player_stats.loc[X_test.index, 'PredictedProbabilities'] = test_probabilities

In [None]:
for country in player_stats['Country'].unique():
    top_players = player_stats[player_stats['Country'] == country].nlargest(11, 'PredictedProbabilities')['Player']
    print(f"\nTop 11 players with the highest probability for {country}:\n{top_players}")

In [None]:
player_stats['PlayingXI'].value_counts()

In [None]:
player_stats

In [None]:
team_india_data = player_stats[player_stats['Country'] == 'India']

player_probabilities = team_india_data.groupby('Player')['PredictedProbabilities'].mean()

sorted_players = player_probabilities.sort_values(ascending=False)

print("Player Selection Probabilities for Team India:")
for player, probability in sorted_players.items():
    print(f"{player}: {probability:.4f}")

# Total Sixes in this Tournament

In [None]:
sixes = pd.read_csv("/kaggle/input/icc-mens-world-cup-2023/deliveries.csv")

In [None]:
sixes

In [None]:
directory = "/kaggle/input/recent-30-days/"

file_list = os.listdir(directory)

final_df = pd.DataFrame()

for file_name in file_list:
    if file_name.endswith(".csv"):
        file_path = os.path.join(directory, file_name)

        try:
            df = pd.read_csv(file_path, on_bad_lines='skip',low_memory = False)
            final_df = pd.concat([final_df, df], ignore_index=True)
            
        except pd.errors.ParserError as e:
            print(f"Error reading {file_name}: {e}")

In [None]:
final_df.drop(columns=['version', '2.2.0'], inplace=True)

In [None]:
final_df.dropna(how='all', inplace=True)

In [None]:
countries = ['England', 'New Zealand', 'Pakistan', 'Netherlands', 'Afghanistan', 'Bangladesh', 'South Africa', 'Sri Lanka', 'Australia', 'India']

In [None]:
final_df = final_df[final_df['batting_team'].isin(countries)]
final_df

In [None]:
final_df['start_date'] = pd.to_datetime(final_df['start_date'])

In [None]:
world_cup = final_df[(final_df['start_date'].dt.month >= 11) & (final_df['start_date'].dt.day >= 2)]

In [None]:
world_cup = world_cup.sort_values(by='start_date')

In [None]:
sixes = pd.concat([sixes, world_cup], ignore_index=True)

In [None]:
sixes

In [None]:
sixes.head(3)

In [None]:
sixes_final = sixes.copy()

In [None]:
sixes_final

In [None]:
sixes_final['six'] = (sixes_final['runs_off_bat'] == 6).cumsum()

In [None]:
sixes_final

In [None]:
data = sixes_final['six'].values.reshape(-1, 1)

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data)

In [None]:
def create_dataset(dataset, look_back=1):
    X, y = [], []
    for i in range(len(dataset)-look_back):
        X.append(dataset[i:(i+look_back), 0])
        y.append(dataset[i + look_back, 0])
    return np.array(X), np.array(y)

In [None]:
look_back = 30
X, y = create_dataset(scaled_data, look_back)

In [None]:
y.shape

In [None]:
train_size = int(len(X) * 0.75)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

In [None]:
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

In [None]:
model = Sequential()
model.add(LSTM(units=50, return_sequences = True, input_shape=(X_train.shape[1], 1)))
model.add(LSTM(units=50, return_sequences = True))
model.add(LSTM(units=25))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model.fit(X_train, y_train, epochs = 50, batch_size = 32, validation_data = (X_test, y_test))

In [None]:
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

In [None]:
plt.plot(np.concatenate([train_predictions, test_predictions]), label='Predicted')
plt.legend()
plt.show()

In [None]:
train_predictions = scaler.inverse_transform(train_predictions)
test_predictions = scaler.inverse_transform(test_predictions)

In [None]:
train_predictions

In [None]:
train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
print(f'Train RMSE: {train_rmse}')
print(f'Test RMSE: {test_rmse}')

In [None]:
plt.plot(scaler.inverse_transform(scaled_data), label='Actual')
plt.plot(np.concatenate([train_predictions, test_predictions]), label='Predicted')
plt.legend()
plt.show()

In [None]:
full_X = np.concatenate((X_train, X_test))
full_y = np.concatenate((y_train, y_test))

model.fit(full_X, full_y, epochs=50, batch_size=32)

In [None]:
forecast_steps = 3600

current_sequence = X_test[-1]

predicted_values = []

In [None]:
X_test[-1]

In [None]:
look_back = 30

In [None]:
for i in range(forecast_steps):
    current_sequence_reshaped = np.reshape(current_sequence, (1, look_back, 1))
    
    next_value = model.predict(current_sequence_reshaped, verbose=0)
    
    predicted_values.append(next_value[0, 0])
    
    current_sequence = np.roll(current_sequence, -1)
    current_sequence[-1] = next_value

In [None]:
predicted_values = np.array(predicted_values).reshape(-1, 1)
predicted_values = scaler.inverse_transform(predicted_values)

In [None]:
predicted_values

In [None]:
print(f'Total Sixes in this ICC World Cup 2023 : {int(np.round(predicted_values)[-1])}')

In [None]:
with open('six.pkl', 'wb') as file:
    pickle.dump(model, file)

In [None]:
with open('six_df.pkl', 'wb') as file:
    pickle.dump(full_X, file)