# Logging into Kaggle for the first time can be daunting.
Our competitions often have large cash prizes, public leaderboards, and involve complex data. Nevertheless, we really think all data scientists can rapidly learn from machine learning competitions and meaningfully contribute to our community. To give you a clear understanding of how our platform works and a mental model of the type of learning you could do on Kaggle, we've created a Getting Started tutorial for the Titanic competition. It walks you through the initial steps required to get your first decent submission on the leaderboard. By the end of the tutorial, you'll also have a solid understanding of how to use Kaggle's online coding environment, where you'll have trained your own machine learning model.

![](https://www.dataquest.io/wp-content/uploads/2017/12/kaggle-fundamentals.jpg)

## Install the required libraries:
  - `pandas` - needed to work with data and represent them in pd.DataFrame for more convenient interaction
  - `numpy` - a library needed for highly efficient work with arrays and with mathematical functions
- `os` - a library for selecting folders where data is downloaded
  - `glob` - to select files for parsing, helps to select files with a certain extension

In [None]:
import os
import glob
os.chdir('/kaggle/input/c/titanic/')

all_file_names = [i for i in glob.glob('*.csv')]
all_file_names

In [None]:
import pandas as pd
import numpy as np

# change the dataframe display setting so that all columns are displayed
pd.options.display.max_columns = None

In [None]:
df = pd.read_csv('/kaggle/input/c/titanic/train.csv')
df

In [None]:
df_test = pd.read_csv('/kaggle/input/c/titanic/test.csv')

Where are not using first column, cause its dubble for id column 

In [None]:
df.drop(df.columns[0], axis=1, inplace=True)

In [None]:
df['Name'].value_counts()

We have prefixes in the name, the type of Mrs. there, the captain, etc., let's see if this affects the model

In [None]:
df = pd.concat([df.drop('Name', axis=1), df_dum])

In [None]:
# Create words that will group the signs
# The second word reflects the semantic load
searchMr = ['Mr' ]
searchDr = ['Dr' ]
searchMrs = ['Mrs']
searchMiss = ['Miss']
searchMadam = ['Madam']



def search_col(df_dum, search):
"""function for finding columns in which there were matches by trigger words
     the input is a list with triggers and a Dataframe, in which you need to find the columns """
    matches=[]
    for col in df_dum.columns:
        for match in search:
            if match in col:
                matches.append(col)
    return matches


def boolen(num):
"""The function for converting numbers to the bool type on the input, numbers on the output True or False
     Depending on 'the number is not 0' """
    if num==0:
        return False
    return True



def conc(Name ,df, matches):
    """A function to group columns and convert data to bool using the boolen function.
     the input of the function is the desired name of the grouped columns (Name), DataFrame with the required columns
      and the columns needed for grouping. The output is a DataFrame without columns that needed to be grouped, but
      with a new column - grouping"""
    df[Name] = False
    for match in matches:
        df[Name] = df[Name]+df[match]
    df[Name] = df[Name].apply(boolen)
    return df.drop(matches, axis=1)
    

In [None]:
searchWinter = ['cyt;', "ktl"]
SearchVision = ['dbl','hfpk']
SearchSign = ['ghbv','ncen']

wint = search_col(df, searchWinter)

sign = search_col(df, SearchSign)

vis = search_col(df, SearchVision)

df = conc('BadWinter', df, wint)
df = conc('BadSign', df, sign)
df = conc('BadVision', df, vis)

In [None]:
df = conc('Miss', df, search_col(df, searchMiss))
df = conc('Capt', df, search_col(df, ['Capt']))
df = conc('Col', df, search_col(df, ['Col']))
df = conc('Dr', df, search_col(df, ['Dr']))
df = conc('Master', df, search_col(df, ['Master']))

This method can be used to group data, but not if there are a lot of columns.

Since we have a lot of signs, and many of them are used extremely rarely, we need to get rid of them
We will get rid of features that are obtained by binarization of text features
Since these values have a bad effect on the generalizing ability of the future model

In [None]:
df.drop(['Cabin'], axis=1, inplace=True)

In [None]:
df.dropna(inplace=True)

In [None]:
df.info()

## Visualizing the impact of attributes
Our visualization will consist of:
- correlation matrix
- Histograms of the influence of signs
- Boxes with mustaches

In [None]:
# seaborn library for correlation matrix
import seaborn as sns
#ploty for boxplots and histograms of the influence of features
import plotly.express as px
#pyplot to increase the size of the plot
import matplotlib.pyplot as plt

from plotly.offline import init_notebook_mode, iplot

init_notebook_mode(connected=True)  

In [None]:
# check each binary column for usage> 10%
proc = len(df)*0.1
for col in df.columns[11:]:
    if df[col].sum()<proc:
        df.drop(col, inplace=True, axis=1)

In [None]:
# increase the size
plt.figure(figsize=(20,20))

# build the correlation matrix
sns.heatmap(df.corr() ,annot=True)


In [None]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(df.corr()[['Survived']].sort_values(by='Survived', ascending=False), annot=True, cmap='BrBG')

heatmap.set_title('Features Correlating with y', fontdict={'fontsize':18}, pad=16)

With this visualization, we can see that some traits have a very strong effect on the `Survival`

In [None]:
for i in ['Pclass','Age','SibSp']:
    print(str(i)+' распределение признака на данных')
    px.box(y=df[i].astype('float'), x=df['Survived']).show()

So okay, I didn't understand anything, it's very interesting, it will be interesting to fuck clustering instead of the target itself

In [None]:
df.Survived.value_counts()

In [None]:
df.drop('Survived', axis=1, inplace=True)

### Using metrics
In this situation, when teaching unsupervised, it is better to use the `silhouette_score` metric. It is a method of interpreting and checking for consistency across data clusters. The technique provides a concise graphical representation of how well each object has been classified. The silhouette value is a measure of how similar an object is to its cluster compared to other clusters.

In [None]:
from sklearn import metrics

In [None]:
df.drop('isWeekday', axis=1, inplace=True)
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].apply(lambda x: x.month)
df['hour'] = df['date'].apply(lambda x: x.hour)
df['day'] = df['date'].apply(lambda x: x.day)
df['day_of_the_week'] = df['date'].apply(lambda x: x.isoweekday())


import seaborn as sns
corre = df.iloc[:, 3:].corr()
sns.heatmap(corre, annot=True)



-----




from sklearn.feature_extraction.text import CountVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,3))
tfidf_features = tfidf.fit_transform(df.text)
tfidf_feature_names = tfidf.get_feature_names()
 

cv = CountVectorizer(ngram_range=(1,3))
cv_features = cv.fit_transform(df.text)
cv_feature_names = cv.get_feature_names()

----

from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
n_clusters = [12,13,14,15,16,17,18,19,20] # number of clusters
clusters_inertia = [] # inertia of clusters
s_scores = [] # silhouette scores

for n in n_clusters:
    KM_est = KMeans(n_clusters=n, init='k-means++').fit(features)
    clusters_inertia.append(KM_est.inertia_)    # data for the elbow method
    silhouette_avg = silhouette_score(features, KM_est.labels_)
    s_scores.append(silhouette_avg) # data for the silhouette score method
    
----


fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(n_clusters, s_scores, marker='o', ax=ax)
ax.set_title("Silhouette score method")
ax.set_xlabel("number of clusters")
ax.set_ylabel("Silhouette score")
ax.axvline(17, ls="--", c="red")
plt.grid()
plt.show()
---

from sklearn.decomposition import NMF, LatentDirichletAllocation
import gensim
from gensim import corpora
import pyLDAvis.gensim


no_topics = 15
 
#NMF
nmf_tfidf = NMF(n_components=10, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf_features)


#LDA
lda_cv = LatentDirichletAllocation(n_components=10, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(cv_features)


---

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
                 print ("\ntopic {}: {}". format (topic_idx, "|" .join ([feature_names [i] for i in topic.argsort () [: - no_top_words-1: -1]])))
 
no_top_words = 10
print ('--------------- Тема NMF-tfidf_features ---------------------------- ------------- ')
display_topics(nmf_tfidf, tfidf_feature_names, no_top_words)
print()
print ('-------------- Тема Lda-CountVectorizer_features ----------------------------- --- ')
display_topics(lda_cv, cv_feature_names, no_top_words)

---
#genism text clust

text_data = df.text.apply(lambda x:x.split())
 # Отфильтруйте слова одного китайского иероглифа
text_data = text_data.apply(lambda x:[w for w in x if len(w)>1] )
 
dictionary = corpora.Dictionary(text_data)
 
 # Отфильтровать слова с частотой менее 5 раз и слова с частотой более 90%
dictionary.filter_extremes(no_below=5, no_above=0.9)

----

# Счетный корпус
corpus = [dictionary.doc2bow(text) for text in text_data]
 
 # Обычная модель LDA
import gensim
no_topics = 9
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = no_topics, id2word=dictionary)
 
topics = ldamodel.print_topics(num_words=8)
for topic in topics:
    print ("тема% d:"% (topic[0]))
    print(topic[1])
    print()
    
    
    
----

#Ниже мы используем корпус на основе TF-IDF для моделирования тем. Мы также указываем 7 тем. Здесь мы используем параллельную многоядерную модель LDA (LdaMulticore). Если ваш процессор многоядерный, вы можете использовать этот метод Для обучения модели это может сократить время обучения:

# tf-idf корпус
tfidf = gensim.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
 
 # Многоядерная параллельная модель lda
no_topics = 9
tf_idf_lda_model = gensim.models.LdaMulticore(corpus_tfidf, num_topics=no_topics, id2word=dictionary, passes=2, workers=4)
 
topics = tf_idf_lda_model.print_topics(num_words=8)
for topic in topics:
    print ("тема% d:"% (topic[0]))
    print(topic[1])
    print()


In [None]:
lda_display = pyLDAvis.gensim.prepare(tf_idf_lda_model, corpus_tfidf, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

### Choosing a Clustering Model
There will be 3 clustering models under consideration:

The clustering model will be selected based on the result of `silhouette_score`. The higher the better.
But we will count this metric on 1/10 of the data, because it will take a lot of time to calculate the metric on all data.

In [None]:
X = df.drop(["Survived", 'Name', 'Sex','Ticket'], axis=1)

In [None]:
X.drop('Embarked', axis=1, inplace=True)

In [None]:
X.dropna(inplace=True)

In [None]:
X

### k-means method
The most popular data clustering algorithm is the k-means method. This is an iterative clustering algorithm based on minimizing the total square deviations of cluster points from the centroids (mean coordinates) of these clusters

In [None]:
### k-means method
from sklearn.cluster import KMeans

# initialize the model
model_kmeans = KMeans(n_clusters=2,random_state=0)
# train our model
model_kmeans.fit(X)
# make predictions
pred_kmeans_test=model_kmeans.predict(X)

### EM algorithm
  EM is a type of clustering algorithm. As the name suggests, each cluster is modeled according to a different Gaussian distribution. This flexible and probabilistic approach to data modeling means that instead of hard assignments to clusters like k-means, we have soft assignments. This means that each data point could have been generated by any of the distributions with the appropriate probability.

In [None]:
# EM algorithm
from sklearn.mixture import GaussianMixture

# initialize the model 
model_gaus = GaussianMixture(n_components=2)
# train our model
model_gaus.fit(X)
# make predictions
pred_model_gaus_test=model_gaus.predict(X)

### Birch clustering
It is an unsupervised data mining algorithm used to implement hierarchical clustering on large datasets. The advantage of `BIRCH` is the ability of the method to dynamically cluster as it receives multidimensional metric data points in an attempt to obtain better clustering for the available set of resources.

In [None]:
# Birch clustering
from sklearn.cluster import Birch

# initialize the model 
model_db = Birch(n_clusters=2)
# train our model
model_db.fit(X)
# make predictions
pred_model_db_test=model_db.predict(X)

In [None]:
# See the result of the metric assessment
metrics.silhouette_score(X,pred_kmeans_test)

In [None]:
# See the result of the metric assessment
metrics.silhouette_score(X, pred_model_gaus_test)

In [None]:
# See the result of the metric assessment
metrics.silhouette_score(X, pred_model_db_test)

In [None]:
X['y'] = pred_kmeans_test

In [None]:
X.y.value_counts()

In [None]:
proc = len(df)*0.05
for col in df.iloc[:,8:].drop('y', axis=1).columns:
        if df[col].sum()< proc:
            df.drop(col, axis=1, inplace=True)

In [None]:
dic = {2:'red', 1:'yellow', 0:'green'}

df['color'] = df.y.map(dic)

fig = px.scatter_mapbox(df,
                        lat='klm',
                        lon='longisland',
                        hover_data=['Harmony', 'count'],
                        color_discrete_sequence=[df.color])

fig.update_layout(mapbox_style='open-street-map')
fig.show()

In [None]:
for i in range(2):
    print('\n')
    print('Median age of group '+str(i)+':\n'+str(np.mean(X[X['y']==i]['Age'])))

In [None]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(df.corr()[['Survived']].sort_values(by='Survived', ascending=False), annot=True, cmap='BrBG')

heatmap.set_title('Features Correlating with y', fontdict={'fontsize':18}, pad=16)

## Classification methods
In testing models, 3 classification methods will be adopted:
### AdaBoost Algorithm
Good generalizing ability. In real problems (not always, but often) it is possible to build compositions that are superior in quality to the basic algorithms. The generalizing ability can improve (in some problems) as the number of basic algorithms increases.
Ease of implementation.
The inherent boosting overhead is low. The composition building time is almost completely determined by the training time of the basic algorithms.
Ability to identify objects that are noise emissions.
### Decision tree algorithm
A decision tree is a useful technique for dividing a complex problem into smaller, more manageable subtasks. The solution of the problem using a decision tree is carried out in two stages. The first stage includes building a decision tree indicating all possible outcomes (financial results) and their probabilities.
### Gradient Boosting Classifier
Radiant boosting is an ensemble of decision trees. This algorithm is based on iterative training of decision trees in order to minimize the loss function. Due to the peculiarities of decision trees, gradient boosting is able to work with categorical features, cope with nonlinearities

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import sklearn
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
#импортируем библиотеку для разбиения
from sklearn.model_selection import train_test_split

training and calculating the classification score (namely, calling `score`, which will give us an idea of the percentage of correctly guessed classes)
And choose a model based on a quality assessment metric, the best model


In [None]:
X_test_train, X_test_test, y_test_train, y_test_test = train_test_split(X.drop('y', axis=1), X.y)


In [None]:
# Decision tree algorithm
tree = DecisionTreeClassifier (random_state = 0, max_depth = 100)
# train the model
tree.fit (X_test_train, y_test_train)

# AdaBoost algorithm
ada = AdaBoostClassifier ()
# train the model
ada.fit (X_test_train, y_test_train)

#Gradient Boosting Classifier
gb = GradientBoostingClassifier ()
# train the model
gb.fit (X_test_train, y_test_train)

Checking for correctly predicted patterns
The `model.score ()` method compares the answers given by the classifier and the correct answers

In [None]:
tree.score(X_test_test,y_test_test)

In [None]:
ada.score(X_test_test,y_test_test)

In [None]:
gb.score(X_test_test,y_test_test)

What I wanted to say with this work is that actually machine learning is easy to start and hard to beat. This is more of a creative work that requires a lot of work, it takes imagination to come up with not only the best model, but also the best solution for it.

![](http://quotefancy.com/media/wallpaper/3840x2160/20391-Nolan-Bushnell-Quote-A-good-game-is-easy-to-learn-but-hard-to.jpg)

In [None]:
import geocoder
from datetime import datetime
from sklearn import *

a = 'Your answer:'


def str2bool (v):
    v = v.lower ()
    if any (word in v for word in ['yes', "aha", "of course", "always"]):
        return True
    if any (word in v for word in ['no', "nope", "never"]):
        return True
    return


def take_inf ():
    b_a ('I'll ask you to answer a couple of questions about the incident')
    b_a ('Give the address of the country with the indication of the city')
    g = geocoder.osm (input (a))
    if g.ok == True:
        region = g.region
        lat = g.osm ['x']
        lon = g.osm ['y']
    else:
        b_a ('Error in geocoding, will have to be written by hand')
        b_a ('Write the name of the region in the imperative')
        region = input (a)
        b_a ('Write the longitude of the place where the accident occurred')
        lon = input (a)
        b_a ('Write the latitude of the place where the accident occurred')
        lat = input (a)

    b_a ('Is the date of the incident today?')
    answer = input (a)
    if str2bool (answer):
        date = datetime.today ()
    else:
        b_a ('Write the date in this format with hours and minutes, example: 08/18/21 18:00'
            )
        date = datetime.striptime (input (a),
                                  '% d.% m.% y. % H:% M '). Astype (' datetime64 ')
    b_a ('Please indicate the number of survivors in this accident')
    inj = (input (a))
    b_a ('Enter the number of deaths')
    death = int (input (a))
    b_a ('In your opinion, how many accidents were there in this area')
    count = int (input (a))
    b_a ('What are the most common accidents in this area? (Severe, Medium, Light)'
        )
    severity = input (a)
    if any (word in severity for word in ("light", "small", "little")):
        severity = 2 * count
    elif any (word in severity for word in ("average")):
        severity = 2 * count
    elif any (word in severity for word in ("heavy", "death", "dangerous")):
        severity = 2 * count
    b_a ('Does this section have problems with the road surface in winter?')
    BadWinter = str2bool (input (a))
    b_a ('Didn't this area have problems with poor lighting?')
    BadVision = str2bool (input (a))
    b_a ('Does this section have problems with misapplication of signs or other road traffic control tools?'
        )
    BadSign = str2bool (input (a))

    b_a ('Didn't this section have any problems with the road surface?')
    bad_road = str2bool (input ('bad_road'))
    df = pd.DataFrame ([[
        lat, lon, region, date, g.location, count, inj, death, severity,
        BadSign, BadWinter, BadVision, bad_road
    ]],
                      columns = [
                          'lat', 'lon', 'region', 'date', 'address', 'count',
                          'inj_count', 'death_count', 'severity', "BadWinter",
                          'BadVision', 'BadSign', 'BadRoad'
                      ])
    return df


def readcsv_df (file, encodding = None, sep = None, csv = False):
    "" "Function for reading data. Accepts both pd.core.DataFrame and csv file. (Parameter csv = True)
    With the indication of the encoding and separator. Returns a Dataframe conditionally checked
    That the columns match ['lat', 'lon', 'region', 'date', 'address',' inj_count ',' death_count ',' road_conditions', 'BadSign', "BadWinter", 'BadLight', 'bad_road '] "" "
    if (csv):
        df = pd.read_csv (filefile, encodding = encodding, sep = sep)
    else:
        df = file
    if set (df.columns)! = set ([
            'lat', 'lon', 'region', 'date', 'address', 'count',
                          'inj_count', 'death_count', 'severity', "BadWinter",
                          'BadVision', 'BadSign', 'BadRoad'
    ]):
        print (
            'Traceback: Reformat the file, make sure the parameters are correct'
        )
    else:
        return df


def pred (df):
    '' 'function for predicting the level of threat' ''
    df = df.iloc [:, 5:]
    model = pickle.load (open ('model.sav', 'rb'))
    pred = model.predict (df)
    if pred == 0:
        res = 'Light site threat'
    if pred == 1:
        res = 'Medium Site Threat'
    if pred == 2:
        res = 'Severe threat level of the site'
    return res


def count_y (gorod, y_level):
    '' 'a function to count the number of sites with a certain level of threat in a certain city' ''
    df = pd.read_csv ('result_with_y.csv')
    return df [df ['y'] == y_level] .groupby (['region']). count (). loc [gorod] ['y']


def b_a (text):
    '' 'function to avoid writing BOT every time:' ''
    print ('Bot:' + str (text) + '\ n')


def bot ():
    n = 1
    while (True):
        if (n == 1):
            b_a ("My commands: \ n1.Find out the number of hazardous areas in a certain region \ n2. find out the hazard level of the area by its parameters \ n3. Collecting data on hazardous areas \ n To exit write complete"
                )
        b_a ("What do you want?")
        answer = input (a)

        if any (word in answer for word in ("how much", "I will pass")):
            b_a ('Error in the name of the city, please write the city in the nominativecase '
                )
            b_a ('Which areas are you interested in by the degree of danger?')
            answer = input (a)
            if any (word in answer.lower ()
                   for word in ("light", "small", "little")):
                y_level = 0
            if any (word in answer.lower () for word in ("average")):
                y_level = 1
            if any (word in answer.lower ()
                   for word in ("heavy", "death", "dangerous")):
                y_level = 2
            b_a ('Error in the name of the city, please write the city in the nominative case')
            gorod = input (a)
            b_a (count_y (gorod, y_level))
            n = n + 1
        if any (word in answer.lower ()
               for word in ('find out', "what", "level")):
            pred (readcsv_df (take_inf ()))
            n = n + 1
        if any (word in answer.lower ()
               for word in ("exit", "bye", 'finish')):
            b_a ("Bye!")
            return 0
        if any (word in answer.lower ()
               for word in ("help", "info", 'commands')):
            n = 1

In [None]:
import geocoder
from datetime import datetime
from sklearn import *


def b (text):
    print ('Bot:' + str (text))


a = 'You:'


def str2bool (v):
    v = v.lower ()
    if any (word in v for word in ['aha', "yes", "of course"]):
        return True
    if any (word in v for word in ['no', "nope", "never", "never was"]):
        return False
    b ('I do not understand you, please repeat')
    str2bool (v)


def take_inf (orig = False):
    if orig == False:
        g = geocoder.osm (input (a))
        if g.ok == True:
            region = g.region
            lat = g.osm ['x']
            lon = g.osm ['y']
        else:
            region = input (a)
            lat = input (a)
            lon = input (a)

        answer = input (a)
        if str2boolbool (answer):
            date = datetime.today ()
        else:
            date = datetime.striptime (input (a),
                                      '% d.% m.%. y. % H:% M '). Astype (' datetime64 ')

    inj = int (input (a))
    dead = int (input (a))
    count = int (input (a))
    severity = input (a) .lower ()
    if any (word in severity for word in ['leg', "small"]):
        severity = count
    elif any (word in severity for word in [
            'average',
            "normal",
    ]):
        severity = count * 2
    elif any (word in severity
           for word in ['heavy', "death", "dangerous"]):
        severity = count * 3
    else:
        b ('Dont_understand')
            severity = input (a) .lower ()
        if any (word in severity for word in ['leg', "small"]):
            severity = count
        elif any (word in severity for word in [
                'average',
                "normal",
        ]):
            severity = count * 2
        elif any (word in severity
               for word in ['heavy', "death", "dangerous"]):
            severity = count * 3
    Wet = str2bool (input (a))
    BadWinter = str2bool (input (a))
    BadSign = str2bool (input (a))
    BadVision = str2bool (input (a))

    if orig == False:
        df = pd.DataFrame ([[
            lat, lon, region, date, g.location, count, inj, death, severity,
            Wet, BadWinter, BadSign, BadVision
        ]],
                          columns = [
                              'lat', 'lon', 'region', 'date', 'address',
                              'count', 'inj', 'death', 'severity', 'Wet',
                              'BadWinter', 'BadSign', 'BadVision'
                          ])
    else:
        df = pd.DataFrame (
            [[count, inj, dead, severity, Wet, BadWinter, BadSign, BadVision]
             ],
            columns = [
                'count', 'inj', 'death', 'severity', 'Wet', 'BadWinter',
                'BadSign', 'BadVision'
            ])
    return df