## Pump it up-Data Mining the Water Table

### 1. Defining the Problem
Tanzania faces a major water crisis. Millions lack access to clean drinking water due to limited freshwater sources and malfunctioning water points. This lack of safe water has severe consequences, including health risks, decreased quality of life, and even death for children. The Tanzanian government is working to improve sanitation, but better water resource management is crucial for the country's future.


### 1.1Tanzania faces a critical challenge:
Millions of people lack access to safe drinking water due to malfunctioning water points. This project aims to leverage machine learning to predict the functionality of water points, helping prioritize maintenance and ensure clean water reaches communities across the country. By analyzing data on factors like pump type, installation age, location, and management practices, we can build a model to classify water points into three categories: functional, needing repair, or non-functional. This information can empower authorities to:


### 1.2Target Maintenance Efforts:
Prioritize repairs for water points most at risk of failure, ensuring efficient resource allocation and minimizing downtime. Preventative Maintenance: Identify pumps nearing the end of their lifespan or susceptible to breakdowns based on historical data, prompting proactive maintenance to avoid service disruptions.


### 1.3Improve Resource Management:
Gain insights into factors affecting water point functionality, informing strategies for pump selection, installation practices, and long-term management approaches.
By harnessing the power of machine learning, we can move beyond reactive repairs and towards a proactive approach to ensuring clean water security for Tanzania's population. This project tackles the critical business problem of water scarcity by predicting water point functionality, ultimately contributing to improved public health and well-being.


### 1.4 Our project will be successful if we can accurately predict whether a water point is:
a. Fully Functional: The water point is operational and delivers clean water without any current repairs needed.

b. Partially Functional (Needs Repair): The water point is currently operational, but there are potential issues requiring repairs to ensure continued functionality.

c. Non-Functional: The water point is completely out of service and requires repairs to provide clean water again.

### 2.0 Importing libraries

In [3]:
# Importing necessary packages for data analysis, visualization, and machine learning

import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns  # For advanced data visualization
import pandas as pd    # For data manipulation and analysis
import numpy as np     # For numerical computations
from sklearn.model_selection import train_test_split  # For splitting data
from sklearn.svm import LinearSVC  # For linear support vector machines
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV  # For hyperparameter tuning
from sklearn.pipeline import Pipeline   # For creating a pipeline of data processing and models
from sklearn.preprocessing import StandardScaler  # For data standardization
from sklearn.ensemble import RandomForestClassifier  # For random forest classification
from sklearn.ensemble import GradientBoostingClassifier  # For gradient boosting classification
from datetime import datetime  
import warnings

warnings.filterwarnings("ignore")  # Suppress warnings 

ImportError: DLL load failed while importing _qhull: The specified module could not be found.

### 2.1 Loading the data

In [None]:
#load data set
train = pd.read_csv('test doc.csv')
test = pd.read_csv('test doc 2.csv')
data = pd.read_csv('test doc 3.csv')

### 2.2 Explore the data

In [None]:
test.shape

In [None]:
test.head()

In [None]:
train.shape

In [None]:
train.head()

In [None]:
data.shape

In [None]:
data.head()

### 2.3 Merging the data

In [None]:
train_data = train.merge(data,on='id',how='inner')

In [None]:
df = pd.concat([train_data, test])
df.head()

In [None]:
df.shape

### 2.4 Understanding the columns of the merged dataset

In [None]:
print('Number of data points : ', df.shape[0])
print('Number of features : ', df.shape[1])
print('Features : ', df.columns.values)
df.head(5)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
pd.options.display.max_columns=100 #for reading all columns

In [None]:
#checking for colum names
df.columns

In [None]:
def check_value_counts(data):
  for column in data.columns:
    print(f'value counts for {column}')
    print(data[column].value_counts())
    print('------------------------------------------','\n')

check_value_counts(df)

The data doesn't have any data inconsistencies.

In [None]:
df.isnull().sum()

This table shows the distribution of missing values across different features. Some features have a significant number of null values.

## 3.0 Data Preparation

### 3.1.Data Cleaning

In [None]:
# Preview sample of records to see whether all records are appropiately ordered
df.sample(10)

In [None]:
# Count the number of duplicate rows
df.duplicated().sum()

In [None]:
df = df.assign(
    construction_year=df['construction_year'].fillna(1993),
    age=pd.to_datetime(df['date_recorded']).dt.year - df['construction_year'],
    pop_year=df['population'].replace({0: 1}) / (df['construction_year'].fillna(1993) - pd.to_datetime(df['date_recorded']).dt.year).clip(lower=1))


In [None]:
# here we check for missing values 
# Dealing with missing values 
# Checking the mumber of missing values by column and sorting for the smallest

Total = df.isnull().sum().sort_values(ascending=False)

# Calculating percentages
percent_1 = df.isnull().sum()/df.isnull().count()*100

# rounding off to one decimal point
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)

# creating a dataframe to show the values
missing_data = pd.concat([Total, percent_2], axis=1, keys=['Total', '%'])
missing_data

data seems to have a significant amount of missing values in some key columns, particularly scheme_name and status_group which could impact analysis

In [None]:
df.fillna("Not Available")  

In [None]:
def replace_missing(value):
  if pd.isna(value):
    return "Missing Value"
  else:
    return value

df = df.fillna(replace_missing)  

In [None]:
# Drop rows with any missing values (inplace)
df.dropna(inplace=True)

In [None]:
# Impute missing values in 'scheme_name' with "No Record"
df.scheme_name = df.scheme_name.fillna('No Record')

# Impute missing values in 'scheme_management' with "No Record"
df.scheme_management = df.scheme_management.fillna('No Record')

# Impute missing values in 'installer' with "No Record"
df.installer = df.installer.fillna('No Record')

# Impute missing values in 'funder' with "No Record"
df.funder = df.funder.fillna('No Record')

# Impute missing values in 'public_meeting' with "No Record" (might need adjustment)
df.public_meeting = df.public_meeting.fillna('No Record')  # Consider 'Unknown' if data could be missing but meetings happened

# Impute missing values in 'permit' with "No Record"
df.permit = df.permit.fillna('No Record')

# Impute missing values in 'subvillage' with "No Record"
df.subvillage = df.subvillage.fillna('No Record')


In [None]:
# Checking for missing values
print(df.isnull().sum())

In [None]:
# Check for duplicates
df.duplicated().sum()

The code is now clean and ready for analysis

In [None]:
# let's get a brief description of the data
df.describe()

In [None]:
#basic descriptive statistics
df['status_group'].describe()

In [None]:
# This output shows the distribution of water point statuses in the data.
# There are 4 unique statuses, with 'functional' being the most frequent (occurring 32259 times, or approximately 43.4% of the data).
# Further analysis can be done to understand the prevalence of other statuses and their implications for water point functionality.

In [None]:
#unique values and their counts

n_unique_values = df['status_group'].nunique()  # Get the number of unique values
value_counts = df['status_group'].value_counts()  # Get the count for each unique value

print(f"Number of unique values: {n_unique_values}")
print(value_counts)

In [None]:
# This output shows the distribution of water point statuses in the data.
# There are 4 unique categories describing water point functionality:
#   * 'functional': Represents water points that are operational (32259 counts).
#   * 'non functional': Represents water points that are not functioning (22824 counts).
#   * '<function replace_missing at 0x000001561A526790>': This likely represents a placeholder value used during data cleaning 
#     (14850 counts). It's recommended to investigate and replace it with a more informative category (e.g., 'missing').
#   * 'functional needs repair': Represents water points requiring maintenance (4317 counts).

In [None]:
#analysis using groupby functionality
unique_value_counts = df.groupby('status_group').size()  # Group by status and count occurrences
print(unique_value_counts)

## 4.1 Exploratory Data Analysis (EDA)

In [None]:
# Selecting object datatypes columns

# Create a list to store categorical column names
categorical = [
    # Columns containing text or non-numerical data
    'basin', 'region',
    'public_meeting', 'recorded_by',
    'scheme_management', 'permit',
    'extraction_type_group', 'extraction_type_class',
    'management', 'management_group', 'payment_type',
    'quality_group', 'quantity_group',
    'source', 'source_type', 'source_class',
    'waterpoint_type_group'
]

### 4.2 Data Distribution

In [None]:
public_meeting_counts = df['public_meeting'].value_counts()
plt.figure(figsize=(8, 5))
public_meeting_counts.plot(kind='bar', color=['skyblue', 'lightgreen'])
plt.xlabel('Public Meeting Held')
plt.ylabel('Number of Water Points')
plt.title('Frequency of Public Meetings Held')
plt.xticks(rotation=0)  # Ensure all labels are visible
plt.show()


In [None]:
import seaborn as sns  # Import seaborn for boxplots

recorded_by_counts = df['recorded_by'].value_counts()  # Get value counts

# Assuming 'recorded_by' is a categorical variable
plt.figure(figsize=(8, 5))
sns.boxplot(x=recorded_by_counts.index, y=recorded_by_counts.values)  # Use counts as y-axis values
plt.xlabel('Recorded By')
plt.ylabel('Number of Water Points')
plt.title('Distribution of Recorded By (Boxplot)')
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.show()


In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='scheme_management', data=df)
plt.xlabel('Scheme Management')
plt.ylabel('Number of Water Points')
plt.title('Water Point Management Distribution')
plt.xticks(rotation=45)  # Rotate x-axis labels if needed
plt.show()


In [None]:
def draw_histogram(data, categorical_column):
  """
  This function creates a histogram for a categorical variable.

  Args:
    data: The pandas DataFrame containing the data.
    categorical_column: The name of the categorical column to visualize.
  """
  plt.figure(figsize=(8, 5))  # Adjust figure size as needed
  plt.hist(data[categorical_column])
  plt.xlabel(categorical_column)
  plt.ylabel('Number of Water Points')
  plt.title(f"Histogram of {categorical_column}", fontsize=12)
  plt.xticks(rotation=45)  # Rotate x-axis labels for readability (optional)
  plt.show()

# Example usage with slight modification (assuming 'basin' is a categorical column):
draw_histogram(df, 'basin')


 shows the distribution of construction year across different basin categories. It helps identify if construction years vary significantly between basins.

In [None]:
def draw_histogram(data, categorical_column):
  """
  This function creates a histogram for a categorical feature.

  Args:
      data: The pandas DataFrame containing the data.
      categorical_column: The name of the categorical column to visualize.
  """
  sns.histplot(
      x = categorical_column,
      data=df,
      multiple='dodge'  # Dodge bars to avoid overlapping for multiple categories
  )
  plt.title(f"Histogram of {categorical_column}", fontsize=12)
  plt.xticks(rotation=45)  # Rotate x-axis labels for readability (optional)
  plt.show()
  plt.clf()  # Clear the plot

# Example usage:
draw_histogram(df, 'region')

In [None]:
def draw_bar_chart(data, categorical_column):
  """
  This function creates a bar chart for a categorical feature with counts.

  Args:
      data: The pandas DataFrame containing the data.
      categorical_column: The name of the categorical column to visualize.
  """
  data[categorical_column].value_counts().plot(kind='bar')
  plt.title(f"Count of Public Meetings ({categorical_column})", fontsize=12)
  plt.xlabel(categorical_column)
  plt.ylabel('Count')
  plt.xticks(rotation=0)  # Keep x-axis labels horizontal
  plt.show()
  plt.clf()  # Clear the plot

# Example usage:
draw_bar_chart(df.copy(), 'public_meeting')

This bar chart shows the number of water points with and without public meetings. It helps understand the prevalence of public meetings for water point projects.

In [None]:
def draw_bar_chart(data, categorical_column):
  """
  This function creates a bar chart for a categorical feature with counts.

  Args:
      data: The pandas DataFrame containing the data.
      categorical_column: The name of the categorical column to visualize.
  """
  data[categorical_column].value_counts().plot(kind='bar')
  plt.title(f"Count of Permits ({categorical_column})", fontsize=12)
  plt.xlabel(categorical_column)
  plt.ylabel('Count')
  plt.xticks(rotation=0)  # Keep x-axis labels horizontal
  plt.show()
  plt.clf()  # Clear the plot

# Example usage:
draw_bar_chart(df.copy(), 'permit')

This bar chart shows the number of water points with and without permits.

In [None]:
def draw_boxplot(data, categorical_column, target_column):
  """
  This function creates a boxplot for a categorical feature vs a target variable.

  Args:
      data: The pandas DataFrame containing the data.
      categorical_column: The name of the categorical column to visualize.
      target_column: The name of the target numerical column (e.g., construction_year).
  """
  sns.boxplot(
      x = categorical_column,
      y = 'construction_year',  # Replace with your target numerical variable if needed
      showmeans=True,
      data=data
  )
  plt.title(f"Boxplot of {categorical_column} vs Construction Year", fontsize=12)
  plt.xticks(rotation=45)  # Rotate x-axis labels for readability (optional)
  plt.show()
  plt.clf()  # Clear the plot

# Example usage:
draw_boxplot(df.copy(), 'extraction_type_group', 'construction_year')

This boxplot shows the distribution of construction year across different extraction type groups. It helps identify if construction years vary significantly between extraction types.

In [None]:
def draw_histogram(data, categorical_column):
  """
  This function creates a histogram for a categorical feature.

  Args:
      data: The pandas DataFrame containing the data.
      categorical_column: The name of the categorical column to visualize.
  """
  sns.histplot(
      x = categorical_column,
      data=df,
      multiple='dodge'  # Dodge bars to avoid overlapping for multiple categories
  )
  plt.title(f"Histogram of {categorical_column}", fontsize=12)
  plt.xticks(rotation=45)  # Rotate x-axis labels for readability (optional)
  plt.show()
  plt.clf()  # Clear the plot

# Example usage:
draw_histogram(df.copy(), 'extraction_type_class')

This histogram shows the frequency of each extraction type class in your data. It helps visualize the distribution of water points across different extraction type details.

In [None]:
def draw_line_chart(data, categorical_column, target_column):
  """
  This function creates a line chart for a categorical feature vs a numerical variable.

  Args:
      data: The pandas DataFrame containing the data.
      categorical_column: The name of the categorical column to visualize on x-axis.
      target_column: The name of the target numerical column (e.g., construction_year).
  """
  data.groupby(categorical_column)[target_column].mean().plot(kind='line')
  plt.title(f"Average {target_column} by {categorical_column}", fontsize=12)
  plt.xlabel(categorical_column)
  plt.ylabel(target_column)
  plt.xticks(rotation=45)  # Rotate x-axis labels for readability (optional)
  plt.show()
  plt.clf()  # Clear the plot

# Example usage:
draw_line_chart(df.copy(), 'management', 'construction_year')

This line chart shows the average construction

In [None]:
df.hist(figsize=(20,20));# distribution of numerical predictors

While the scatter plots of predictor variables against the target variable suggest a balanced distribution across predictor values, further analysis is needed to understand the relationships between these variables. This will help determine their suitability for building a robust prediction model.

### Explore outliers

In [None]:
def check_outliers(data, columns):
    fig, axes = plt.subplots(nrows=len(columns), ncols=1, figsize=(20,10))
    for i, column in enumerate(columns):
        # Use interquartile range (IQR) to find outliers for the specified column
        q1 = data[column].quantile(0.25)
        q3 = data[column].quantile(0.75)
        iqr = q3 - q1
        print("IQR for {} column: {}".format(column, iqr))

        # Determine the outliers based on the IQR
        outliers = (data[column] < q1 - 1.5 * iqr) | (data[column] > q3 + 1.5 * iqr)
        print("Number of outliers in {} column: {}".format(column, outliers.sum()))

        # Create a box plot to visualize the distribution of the specified column
        sns.boxplot(data=data, x=column, ax=axes[i])
    plt.show()


num=df.select_dtypes('number')
columns=num.columns
check_outliers(df, columns)

The data has outliers but we won't remove them because that information could be useful for prediction

In [None]:
column_names = df.columns
print(column_names)

In [None]:
# let's drop the status-group

df_status=df[['id', 'status_group']]
df_status.head()

In [None]:

df_features=df[['id','amount_tsh', 'date_recorded', 'funder',
       'gps_height', 'installer', 'longitude', 'latitude', 'wpt_name',
       'num_private', 'basin', 'subvillage', 'region', 'region_code',
       'district_code', 'lga', 'ward', 'population', 'public_meeting',
       'recorded_by', 'scheme_management', 'scheme_name', 'permit',
       'construction_year', 'extraction_type', 'extraction_type_group',
       'extraction_type_class', 'management', 'management_group', 'payment',
       'payment_type', 'water_quality', 'quality_group', 'quantity',
       'quantity_group', 'source', 'source_type', 'source_class',
       'waterpoint_type', 'waterpoint_type_group', 'age']]

In [None]:
df_features.head()

In [None]:
print(df_status.dtypes)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Check data types
print(df_status.dtypes)

# If there are functions or other non-string values
non_strings = df_status['status_group'].apply(lambda x: not isinstance(x, str))
df_status = df_status[~non_strings]  # Filter out rows with non-strings

# Now you can use LabelEncoder
label_encoder = preprocessing.LabelEncoder()
df_status['status_group'] = label_encoder.fit_transform(df_status['status_group'])


In [None]:
from sklearn import preprocessing

# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'status_group'.
df_status['status_group'] = label_encoder.fit_transform(df_status['status_group'])

df_status['status_group'].unique()


In [None]:
df_status.head()

### Performing Feature Engineering

In [None]:
df=df_features
df.columns

In [None]:
# for df
df['water_/_person'] = df['amount_tsh'].replace({0:1}) / df['population'].replace({0:1})

We will then write a function to check for the cardinality of each feature(how many unique values there are in the feature)

In [None]:
def reverse_cardinality_check(n, df):

# this function will search the dataframe for features above the cardinality limit, 
# then create a dict from the results

  
  feature_list = []
  
  cardinality_value = []
  
  for _ in range(len(df.columns)):
    if len(df[df.columns[_]].value_counts()) > n:
      
      feature_list.append(df.columns[_])
      
      cardinality_value.append(len(df[df.columns[_]].value_counts()))
                               
        
  feature_dict = dict(zip(feature_list, cardinality_value))
  
  return feature_dict

We will then preview our high cardinality features

In [None]:
high_cardinality_feature_dict = reverse_cardinality_check(150, df)
high_cardinality_feature_dict

We will create dataframes for our high and low cardinality features

In [None]:
# dataframe for high cardinality
high_cardinality_features = df[list(high_cardinality_feature_dict.keys())]
high_cardinality_features.columns

In [None]:
# dataframe for low cardinality features
low_cardinality_features = df.drop(columns = list(high_cardinality_feature_dict.keys()))
low_cardinality_features.columns

Let us now perform one hot encoding for each dataframe

In [None]:
def clean_data(data):
  """
  This function cleans data by handling missing values and converting booleans to strings.

  Args:
      data: A pandas DataFrame.

  Returns:
      A cleaned pandas DataFrame.
  """
  # Fill missing values with a placeholder (e.g., -1 or a string)
  data = data.fillna(-1)  # Replace with your preferred method

  # Remove rows with functions or other non-string values (excluding missing values)
  non_strings = data.apply(lambda x: not isinstance(x, str))
  clean_data = data[~(non_strings.any())]  # Use .any() to combine conditions

  # Convert booleans to strings (optional)
  clean_data = clean_data.applymap(lambda x: str(x) if isinstance(x, bool) else x)
  return clean_data


In [None]:
high_cardinality_features.isnull().sum()

In [None]:
# features = low_cardinality_features.concat(high_cardinality_features,on = low_cardinality_features.index)
frames =[low_cardinality_features, high_cardinality_features]

features = pd.concat(frames, axis = 1)

In [None]:
# previewing the datatset
features.head()

Next we impute and scale our features

In [None]:
#Merging df_status and features
df_1 = df_status.merge(features, left_on='id', right_on='id')
df_1.head()

In [None]:
# train and test are different shapes. Find which columns are different.
df.columns

In [None]:
import pandas as pd

# Assuming 'df_1' is your DataFrame

# Preprocessing (replace with your data loading steps)
# ... (load your data into df_1)

# Define y variable (dependent variable) - indicates tap functionality
y = df['waterpoint_type_group'] == 'functional'  # Binary indicator

# Define x variable (independent variable) - explore functionality by region
x = df['region']

# Split data into training and testing sets (assuming scikit-learn is installed)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# ... (Train your machine learning model using X_train and y_train)

# ... (Evaluate your model performance on X_test and y_test)


In [None]:
print('df: ', X_train.shape, y_train.shape)
print('df: ', X_test.shape, y_test.shape)

In [None]:
import pandas as pd

# Assuming X_train is a Series
X_train = pd.DataFrame(X_train)  # Convert Series to DataFrame


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Assuming you've loaded your data into X_train

# Check if X_train is a DataFrame (optional)
if not isinstance(X_train, pd.DataFrame):
    X_train = pd.DataFrame(X_train)  # Convert Series to DataFrame (if needed)

# Check for column presence
if 'Region' in X_train.columns:
    # Check for missing values
    missing_values = X_train['Region'].isnull().sum()
    if missing_values > 0:
        print(f"Warning: {missing_values} missing values in 'Region' column.")
        # Handle missing values (e.g., imputation or dropping rows)

    # Proceed with label encoding
    le = LabelEncoder()
    X_train['Region'] = le.fit_transform(X_train['Region'])  # Assuming 'Region' is the categorical column
else:
    print("Error: 'Region' column not found in X_train data.")
    # Handle the missing column (e.g., investigate data source)


In [None]:
import pandas as pd

# Check if X_train is a DataFrame (optional)
if not isinstance(X_train, pd.DataFrame):
    X_train = pd.DataFrame(X_train)  # Convert Series to DataFrame (if needed)

# Assuming you have region data in a separate list or variable called 'region_data'
X_train['Region'] = region_data  # Add the new column

# Now X_train will have the 'Region' column with your region data


In [None]:
X_train['Region'] = ['Unknown'] * len(X_train)  # Fill with 'Unknown' for all rows


In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
import pandas as pd

# Assuming your data is in a DataFrame called 'df'

# Iterate over the columns of the dataframe
for col in df.columns:
  # Check if the data type of the column is float
  if pd.api.types.is_float_dtype(df[col]):
    # Convert the column to string type with 'f' format specifier to maintain precision (optional)
    df[col] = df[col].astype(str)
  else:
    # Leave non-float columns unchanged
    pass

# Now 'df' will have all float columns converted to strings


In [None]:
def convert_to_string(col):
  if pd.api.types.is_float_dtype(col):
    return col.astype(str)
  else:
    return col

df = df.apply(convert_to_string)


### Training the Model

Starting the algorithm with k=4 neighbors at first.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder
le = LabelEncoder()

# Fit the encoder on the training data (assuming 'region' is the column containing categorical values)
X_train['region'] = le.fit_transform(X_train['region'])

# Now X_train['region'] will contain integer labels for each region (e.g., 0 for "Kagera")

# Repeat the encoding for the testing data (X_test) using the fitted encoder (le)
X_test['region'] = le.transform(X_test['region'])

In [None]:
from io import StringIO


In [None]:
dot_data = StringIO()
filename = 'pumptree.png'
featureNames = df.columns[0:196]
targetNames = df['status_group']
out = tree.export_graphviz(dtc, feature_names=featureNames, 
                           out_file=dot_data, 
                           class_names=np.unique(y_train), 
                           filled=True, 
                           special_characters=True, 
                           rotate=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

### Geo-Visualiziation of the pumps

In [None]:
pip install geopandas

In [None]:
import matplotlib.pyplot as plt  # Import for figure size
import geopandas as gpd  # Import for geospatial data

# ... rest of your code ...

gdf = gpd.GeoDataFrame(df_1, geometry=gpd.points_from_xy(df_1.longitude, df_1.latitude))

In [None]:
import matplotlib.pyplot as plt  # Import matplotlib.pyplot for 'rcParams'

# Set the figure size
plt.rcParams['figure.figsize'] = 30, 20


# let's visualize the data
gdf = geopandas.GeoDataFrame(df_1, geometry=geopandas.points_from_xy(df_1.longitude, df_1.latitude))

functional = gdf.where(gdf['status_group'] == 0)
repair = gdf.where(gdf['status_group'] == 2)
abandoned = gdf.where(gdf['status_group'] == 1)
broken = gdf.where(gdf['status_group'] == 3)



world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# We restrict to Africa
ax = world[world.continent == 'Africa'].plot(
    color='gray', edgecolor='black')

ax.scatter(functional['longitude'], functional['latitude'],
           c='green',alpha=.5, s=3)

ax.scatter(repair['longitude'], repair['latitude'],
           c='blue', alpha=.5, s=5)

ax.scatter(broken['longitude'], broken['latitude'],
           c='red', alpha=.5, s=5)
plt.title("Map of Pump Distributions, Green-Functional, Blue-Repair, Red-Broken", fontsize = 25)

plt.ylim(-12, 0)
plt.xlim(28,41)

plt.show()


In [None]:
df.columns

In [None]:
print(df['date_recorded'].dtypes)


In [None]:
pip install --upgrade pandas


In [None]:
print(df['date_recorded'].dtypes)

In [None]:
data["date_recorded"] = pd.to_datetime(data['date_recorded'])

### Normalize the Data

In [None]:
# X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))

In [None]:
# let's import decision trees classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle = True, random_state=0)

model = tree.DecisionTreeClassifier()
model = model.fit(X_train, y_train)

predicted_value = model.predict(X_test)
print(predicted_value)
#%%
tree.plot_tree(model)

zeroes = 0
ones = 0
for i in range(0,len(y_train)):
    if y_train[i] == 0:
        zeroes +=1
    else:
        ones +=1
      
print(zeroes)
print(ones)

val = 1 - ((zeroes/70)*2 + (ones/70)*2)
print("Gini :-",val)
 
match = 0
UnMatch = 0
 
for i in range(30):
    if predicted_value[i] == y_test[i]:
        match += 1
    else:
        UnMatch += 1
         
accuracy = match/30
print("Accuracy is: ",accuracy)

## XG BOOST Classifier
Base Model

In [None]:
# train test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# training XGboost on the training
classifier = XGBClassifier()
classifier.fit(x_train, y_train)

In [None]:
# Making the Confusion Matrix
y_pred = classifier.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

## Cross Validation