<a id="problema"></a>
# <font color=green>Problem statement</font>

For every aspiring entrepreneur, embarking on the journey of starting a new business presents a myriad of challenges. The most daunting of these challenges is the uncertainty surrounding access to vital information that supports informed decision-making and helps mitigate the risks associated with investing time and money.

In light of this, providing entrepreneurs with tools that furnish organized, accurate, and reliable information would be immensely beneficial. Such tools would offer them a sense of security and heightened confidence in assessing the profitability of their business ideas.

Henceforth, unquestionably, **data science** emerges as the optimal tool for crafting such support tools for entrepreneurs and their nascent businesses.

<a id="preguntas"></a>
# <font color=green>Asking Questions </font>

In accordance with the problem posed above, the following questions arose from both the entrepreneur and the team:

1. What is the best location in Guadalajara, Mexico to open my bicycle business?
2. How many sales will I be able to obtain in the first months of starting my business?
3. What are the products most in demand by potential customers?
4. What prices will be the most competitive for bicycles?
5. How often will a customer want to buy clothing or accessories for their bikes?
6. How many customers will come to my business for a repair or upgrade on their bicycle?

While we cannot guarantee that we will address all the questions raised, we assured the entrepreneur that we will conduct an analysis of the available data. Subsequently, we will evaluate which information can be presented.

# Data Collection
The decision was made to use "Bike Buyers 1000" and "Bike Sales", given that they contain information relevant to the problem being discussed and both datasets complement each other, which were found on kaggle.

"Bike Buyers 1000" 
link: "https://www.kaggle.com/datasets/heeraldedhia/bike-buyers"

"Bike Sales"
link: "https://www.kaggle.com/datasets/liyingiris90/bike-sales"

## Process to obtain the dataset

Since the files contained within the Bike Sales data set are in .xlsx format, the Pandas function used was <font color =red> *pd.read_excel* </font>.

For Bike Buyers, since it is a csv format, our already known function is used  <font color =red> *pd.read_csv*</font>  

In [None]:
%pip install openpyxl

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from colorama import Fore
from colorama import Style
import matplotlib.pyplot as plt
sns.set()
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.cluster import KMeans

The respective DataFrames of the data set are saved <font color=blue> "bike sales"</font> and <font color=blue> "bike buyers"

In [None]:
buyers = pd.read_csv("datasets/bike_buyers.csv")

In [None]:
bikes = pd.read_excel("datasets/bikes.xlsx")
bikestores = pd.read_excel("datasets/bikeshops.xlsx")
orders = pd.read_excel("datasets/orders.xlsx")

# Dataset Exploration
We proceed to review each DataFrame using <font color=blue> .head() </font> and <font color=blue>.tail()  </font>, as well as <font color=blue> .dtypes, .columns, .shape, .loc[$n:m$] </font> (where $n,m ∈ 𝖭$ and $n<m.$) 

In [None]:
buyers.dtypes # There are floats that should be integers.
buyers.shape # There are 1000 entries and 13 columns.
buyers.columns # ID, Marital S, Gender, Income, Children, Education, Occupation, Home Owner, Cars, Commute Distance, Region, Age, purchase bike
buyers.head() # NaNs are observed
buyers.tail() # NaNs are observed
c=list(buyers.columns) # The name of the columns is saved in a list, which will be used later to change the names of our dataframe. 
buyers.loc[500:515] # NaNs are observed. 

In [None]:
bikes.dtypes # The only data that appears to be incorrect is the price, which is of type int. 
bikes.shape # There are a total of 97 rows with 5 columns
bikes.columns # bike.id, model, category1, category2, frame, price. Columns will be renamed.
bikes.head() # NaNs are not observed.
bikes.tail() # NaNs are not observed.
bc=list(bikes.columns)
bikes.loc[46:58] # NaNs are not observed randomly

In [None]:
bikestores.dtypes # All data has the correct type.
bikestores.shape # There are a total of 30 rows with 6 columns
bikestores.columns # bikeshop.id , bikeshop.name, bikeshop.city, bikeshop.state, latitude , longitude. Columns will be renamed.
bikestores.head() # NaNs are not observed.
bikestores.tail() # NaNs are not observed.
bs=list(bikestores.columns)
bikestores.loc[10:25] # NaNs are not observed randomly. 

In [None]:
orders.dtypes # It is observed that order.id, order.line, customer.id y product.id They are of type float when they should be of type integer.
orders.shape # There are a total of 15644 rows and 7 columns. 
orders.columns # 'Unnamed: 0', 'order.id', 'order.line', 'order.date', 'customer.id','product.id' y 'quantity'. Columns will be renamed.
oc=list(orders.columns) 
orders.head() # No Nans are observed, on the other hand, it is observed that the column 'Unnamed: 0' is repeated. 
orders.tail() # NaNs are not observed. 
orders.loc[10468:10480] # NaNs are not observed.

Knowing the values of some columns of some dataSets

In [None]:
orders['product.id'].unique()

In [None]:
orders['customer.id'].unique()

In [None]:
bikes['bike.id'].unique()

In [None]:
bikestores['bikeshop.id'].unique()

# Renombramos las columnas en los DataFrames.

Se crean diccionarios para cambiar el nombre de las columnas de nuestros data frames.

In [17]:
buy_names = {c[0]:'id_buyer',
              c[1]:'civil_status',
              c[2]:'gender',
              c[3]:'salary',
              c[4]:'children',
              c[5]:'education',
              c[6]:"profession",
              c[7]:"own_house",
              c[8]:'cars',
              c[9]:'trip_distance',
              c[10]:'region',
              c[11]:'age',
              c[12]:'purchased_bicycle'}

bike_names = {bc[0]:'id_bicycle',
              bc[1]:'model',
              bc[2]:'category_1',
              bc[3]:'category_2',
              bc[4]:'alloy',
              bc[5]:'price'}

stores_names = {bs[0]:'id_store',
                  bs[1]:'store_name',
                  bs[2]:'city',
                  bs[3]:'state',
                  bs[4]:'latitude',
                  bs[5]:'length'}

orders_names = {oc[1]:'id_order',
                  oc[2]:'order_line',
                  oc[3]:'order_date',
                  oc[4]:'id_store',
                  oc[5]:'id_bicycle',
                  oc[6]:'items_number'}

We notice the fact, that the customer_id column is actually the store_id. Since both match the number of entries and the values associated with each one.

In [18]:
buyers = buyers.rename(columns=buy_names)
orders = orders.rename(columns=orders_names)
bikes = bikes.rename(columns=bike_names)
stores =  bikestores.rename(columns=stores_names)

In [19]:
buyers.name = 'buyers'
orders.name = 'orders'
bikes.name = 'bikes'
stores.name = 'stores'

# Removal of NaNs

To have cleaner DataFrames, the NaNs are eliminated

In [20]:
# The following functions are created to determine NaNs between the different DataFrames.
def nans_numbers(dataframe):
    print(dataframe.isna().sum())

def nans_percentage(dataframe):
    print(dataframe.name)
    print(dataframe.isna().sum()/len(dataframe)*100)
    print("\n")

In [None]:
nans_percentage(orders)
nans_percentage(stores)
nans_percentage(bikes)
nans_percentage(buyers) # The only DataFrame with NaNs present is the Buyers one.. 
nans_numbers(buyers)