# <div style="text-align: center">A Data Science Framework for Quora </div>
### <div align="center"><b>Quite Practical and Far from any Theoretical Concepts</b></div>
<img src='http://s9.picofile.com/file/8342477368/kq.png'>
<div style="text-align:center">last update: <b>19/01/2019</b></div>

You can Fork and Run this kernel on **Github**:
> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)


 <a id="1"></a> <br>
## 1- Introduction
<font color="red">Quora</font> has defined a competition in **Kaggle**. A realistic and attractive data set for data scientists.
on this notebook, I will provide a **comprehensive** approach to solve Quora classification problem for **beginners**.

I am open to getting your feedback for improving this **kernel**.

<a id="top"></a> <br>
## Notebook  Content
1. [Introduction](#1)
1. [Data Science Workflow for Quora](#2)
1. [Problem Definition](#3)
    1. [Business View](#31)
        1. [Real world Application Vs Competitions](#311)
    1. [What is a insincere question?](#32)
    1. [How can we find insincere question?](#33)
1. [Problem feature](#4)
    1. [Aim](#41)
    1. [Variables](#42)
    1. [ Inputs & Outputs](#43)
1. [Select Framework](#5)
    1. [Import](#51)
    1. [Version](#52)
    1. [Setup](#53)
1. [Exploratory data analysis](#6)
    1. [Data Collection](#61)
        1. [Features](#611)
        1. [Explorer Dataset](#612)
    1. [Data Cleaning](#62)
    1. [Data Preprocessing](#63)
        1. [Is data set imbalance?](#631)
        1. [Some Feature Engineering](#632)
    1. [Data Visualization](#64)
        1. [countplot](#641)
        1. [pie plot](#642)
        1. [Histogram](#643)
        1. [violin plot](#645)
        1. [kdeplot](#646)
1. [Apply Learning](#7)
1. [Conclusion](#8)
1. [References](#9)

-------------------------------------------------------------------------------------------------------------

 **I hope you find this kernel helpful and some <font color="red"><b>UPVOTES</b></font> would be very much appreciated**
 
 -----------

<a id="2"></a> <br>
## 2- A Data Science Workflow for Quora
Of course, the same solution can not be provided for all problems, so the best way is to create a **general framework** and adapt it to new problem.

**You can see my workflow in the below image** :

 <img src="http://s8.picofile.com/file/8342707700/workflow2.png"  />

**You should feel free	to	adjust 	this	checklist 	to	your needs**
###### [Go to top](#top)

<a id="3"></a> <br>
## 3- Problem Definition
I think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( **Problem Formalization**)
> **we will be predicting whether a question asked on Quora is sincere or not.**
<a id="31"></a> <br>
## 3-1 About Quora
Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.
<a id="32"></a> <br>
## 3-2 Business View 
An existential problem for any major website today is how to handle toxic and divisive content. **Quora** wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

**Quora** is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this kernel, I will develop models that identify and flag insincere questions.we Help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.
<a id="321"></a> <br>
### 3-2-1 Real world Application Vs Competitions
Just a simple comparison between real-world apps with competitions:
<img src="http://s9.picofile.com/file/8339956300/reallife.png" height="600" width="500" />
<a id="33"></a> <br>
## 3-3 What is a insincere question?
Is defined as a question intended to make a **statement** rather than look for **helpful answers**.
<img src='http://s8.picofile.com/file/8342711526/Quora_moderation.png'>
<a id="34"></a> <br>
## 3-4 How can we find insincere question?
Some characteristics that can signify that a question is insincere:

1. **Has a non-neutral tone**
    1. Has an exaggerated tone to underscore a point about a group of people
    1. Is rhetorical and meant to imply a statement about a group of people
1. **Is disparaging or inflammatory**
    1. Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
    1. Makes disparaging attacks/insults against a specific person or group of people
    1. Based on an outlandish premise about a group of people
    1. Disparages against a characteristic that is not fixable and not measurable
1. **Isn't grounded in reality**
    1. Based on false information, or contains absurd assumptions
    1. Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers
    ###### [Go to top](#top)

<a id="4"></a> <br>
## 4- Problem Feature
Problem Definition has three steps that have illustrated in the picture below:

1. Aim
1. Variable
1. Inputs & Outputs





<a id="41"></a> <br>
### 4-1 Aim
We will be predicting whether a question asked on Quora is **sincere** or not.


<a id="42"></a> <br>
### 4-2 Variables

1. qid - unique question identifier
1. question_text - Quora question text
1. target - a question labeled "insincere" has a value of 1, otherwise 0

<a id="43"></a> <br>
### 4-3 Inputs & Outputs
we use train.csv and test.csv as Input and we should upload a  submission.csv as Output


**<< Note >>**
> You must answer the following question:
How does your company expect to use and benefit from **your model**.
###### [Go to top](#top)

<a id="5"></a> <br>
## 5- Select Framework
After problem definition and problem feature, we should select our framework to solve the problem.
What we mean by the framework is that  the programming languages you use and by what modules the problem will be solved.
###### [Go to top](#top)

<a id="52"></a> <br>
## 5-2 Import

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from wordcloud import WordCloud as wc
from nltk.corpus import stopwords
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
from pandas import get_dummies
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import string
import scipy
import numpy
import nltk
import json
import sys
import csv
import os
  

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jagap\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

<a id="53"></a> <br>
## 5-3 version

In [2]:
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))


matplotlib: 3.1.3
sklearn: 0.22.1
scipy: 1.4.1
seaborn: 0.10.0
pandas: 1.0.1
numpy: 1.18.1
Python: 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]


<a id="54"></a> <br>
## 5-4 Setup

A few tiny adjustments for better **code readability**

In [3]:
sns.set(style='white', context='notebook', palette='deep')
pylab.rcParams['figure.figsize'] = 12,8
warnings.filterwarnings('ignore')
mpl.style.use('ggplot')
sns.set_style('white')
%matplotlib inline

<a id="55"></a> <br>
## 5-5 NLTK
In this kernel, we use the NLTK library So, before we begin the next step, we will first introduce this library.
<img src='https://arts.unimelb.edu.au/__data/assets/image/0005/2735348/nltk.jpg' width=300 height=300>

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize
data = "All work and no play makes jack a dull boy, all work and no play"
print(word_tokenize(data))

['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'a', 'dull', 'boy', ',', 'all', 'work', 'and', 'no', 'play']


<a id="6"></a> <br>
## 6- EDA
By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.  then We will review analytical and statistical operations:

1. Data Collection
1. Visualization
1. Data Cleaning
1. Data Preprocessing
<img src="http://s9.picofile.com/file/8338476134/EDA.png" width=400 height=400>

 ###### [Go to top](#top)

<a id="61"></a> <br>
## 6-1 Data Collection
I start Collection Data by the training and testing datasets into **Pandas DataFrames**.
###### [Go to top](#top)

In [8]:
pwd

'C:\\Users\\jagap\\Desktop\\Python Tutorial\\Data Science\\5- Big Data'

In [11]:
train = pd.read_csv(r'C:\Users\jagap\Desktop\Python Tutorial\Data Science\input\input\train.csv')
test = pd.read_csv(r'C:\Users\jagap\Desktop\Python Tutorial\Data Science\input\input\test.csv')


In [15]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**<< Note 1 >>**

* Each **row** is an observation (also known as : sample, example, instance, record).
* Each **column** is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate).
###### [Go to top](#top)

In [16]:
train.sample(1) 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
846,847,0,3,"Sage, Mr. Douglas Bullen",male,,8,2,CA. 2343,69.55,,S


In [None]:
test.sample(1) 

Or you can use others command to explorer dataset, such as 

In [None]:
train.tail(1)

<a id="611"></a> <br>
## 6-1-1 Features
Features can be from following types:
* numeric
* categorical
* ordinal
* datetime
* coordinates

Find the type of features in **Qoura dataset**?!

For getting some information about the dataset you can use **info()** command.

In [None]:
print(train.info())

In [17]:
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None


<a id="612"></a> <br>
## 6-1-2 Explorer Dataset

###### [Go to top](#top)

In [18]:
# shape for train and test
print('Shape of train:',train.shape)
print('Shape of test:',test.shape)

Shape of train: (891, 12)
Shape of test: (418, 11)


In [19]:
#columns*rows
train.size

10692

After loading the data via **pandas**, we should checkout what the content is, description and via the following:

In [20]:
type(train)

pandas.core.frame.DataFrame

In [21]:
type(test)

pandas.core.frame.DataFrame

In [22]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


To pop up 5 random rows from the data set, we can use **sample(5)**  function and find the type of features.

In [None]:
train.sample(5) 

<a id="62"></a> <br>
## 6-2 Data Cleaning

###### [Go to top](#top)

How many NA elements in every column!!

Good news, it is Zero!

To check out how many null info are on the dataset, we can use **isnull().sum()**.

In [23]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

But if we had , we can just use **dropna()**(be careful sometimes you should not do this!)

In [24]:
# remove rows that have NA's
print('Before Droping',train.shape)
train = train.dropna()
print('After Droping',train.shape)

Before Droping (891, 12)
After Droping (183, 12)



We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

To print dataset **columns**, we can use columns atribute.

In [27]:
train.columns
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


You see number of unique item for Target  with command below:

In [30]:
train_target = train['Survived'].values
train_target

#np.unique(train_target)

array([1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 1], dtype=int64)

YES, quora problem is a **binary classification**! :)

To check the first 5 rows of the data set, we can use head(5).

In [31]:
train.head(5) 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


Or to check out last 5 row of the data set, we use tail() function.

In [None]:
train.tail() 

To give a **statistical summary** about the dataset, we can use **describe()**


In [None]:
train.describe() 

As you can see, the statistical information that this command gives us is not suitable for this type of data
**describe() is more useful for numerical data sets**

**<< Note 2 >>**
in pandas's data frame you can perform some query such as "where"

In [32]:
train.where(train ['Survived']==1).count()

PassengerId    123
Survived       123
Pclass         123
Name           123
Sex            123
Age            123
SibSp          123
Parch          123
Ticket         123
Fare           123
Cabin          123
Embarked       123
dtype: int64

As you can see in the below in python, it is so easy perform some query on the dataframe:

In [34]:
train[train['Survived']>1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked


Some examples of questions that they are insincere

In [35]:
train[train['Survived']==1].head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S


<a id="631"></a> <br>
## 6-3-1 Is data set imbalance?


In [36]:
train_target.mean()

0.6721311475409836

A large part of the data is unbalanced, but **how can we  solve it?**

In [38]:
train["Survived"].value_counts()
# data is imbalance

1    123
0     60
Name: Survived, dtype: int64

<a id="632"></a> <br>
## 6-3-2 Exploreing Question

In [39]:
question = train['question_text']
i=0
for q in question[:5]:
    i=i+1
    print('sample '+str(i)+':' ,q)

KeyError: 'question_text'

In [None]:
text_withnumber = train['question_text']
result = ''.join([i for i in text_withnumber if not i.isdigit()])

<a id="632"></a> <br>
## 6-3-2 Some Feature Engineering

[NLTK](https://www.nltk.org/) is one of the leading platforms for working with human language data and Python, the module NLTK is used for natural language processing. NLTK is literally an acronym for Natural Language Toolkit.

We get a set of **English stop** words using the line

In [40]:
#from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))

The returned list stopWords contains **179 stop words**  on my computer.
You can view the length or contents of this array with the lines:

In [41]:
print(len(eng_stopwords))
print(eng_stopwords)

179
{'or', "you'll", 'down', 'over', 'from', 'couldn', 'yours', 'her', "you'd", 'above', 'same', 'there', 'but', 'isn', 'ourselves', 'him', "she's", 'which', 'their', 'wasn', 'nor', "isn't", 'shouldn', 'that', 'during', 'have', 'o', 'these', "you're", 'his', 'too', 'few', 'theirs', 'being', 'mustn', 'only', "mightn't", 'myself', 'themselves', 'am', 'a', 'after', 't', 'been', 'below', 'up', 'll', 'those', 'having', 'wouldn', 'my', 'yourself', 'just', 'before', 'on', 'what', 'each', 'ain', 'we', "haven't", 'here', "should've", 'has', 'now', 'ma', 'to', 'doesn', 'be', "hadn't", "that'll", 'no', 'won', 'some', 'while', 'until', 'an', 'them', 'both', 'off', 'it', "shouldn't", 'through', 'aren', "wouldn't", 'out', "weren't", 'the', 'she', 'very', 'in', 'most', 'against', "mustn't", 'you', 'm', 'i', 'than', 'where', 'weren', 'between', 'again', 'hers', 'when', 'shan', 'because', 'own', 'he', 's', 'hadn', "needn't", 'y', 'himself', 'this', 'of', 'how', 'if', 'as', "hasn't", 'other', 'into', 'h

The metafeatures that we'll create based on  SRK's  EDAs, [sudalairajkumar](http://http://www.kaggle.com/sudalairajkumar/simple-feature-engg-notebook-spooky-author) and [tunguz](https://www.kaggle.com/tunguz/just-some-simple-eda) are:
1. Number of words in the text
1. Number of unique words in the text
1. Number of characters in the text
1. Number of stopwords
1. Number of punctuations
1. Number of upper case words
1. Number of title case words
1. Average length of the words

###### [Go to top](#top)

Number of words in the text 

In [None]:
train["num_words"] = train["question_text"].apply(lambda x: len(str(x).split()))
test["num_words"] = test["question_text"].apply(lambda x: len(str(x).split()))
print('maximum of num_words in train',train["num_words"].max())
print('min of num_words in train',train["num_words"].min())
print("maximum of  num_words in test",test["num_words"].max())
print('min of num_words in train',test["num_words"].min())


Number of unique words in the text

In [None]:
train["num_unique_words"] = train["question_text"].apply(lambda x: len(set(str(x).split())))
test["num_unique_words"] = test["question_text"].apply(lambda x: len(set(str(x).split())))
print('maximum of num_unique_words in train',train["num_unique_words"].max())
print('mean of num_unique_words in train',train["num_unique_words"].mean())
print("maximum of num_unique_words in test",test["num_unique_words"].max())
print('mean of num_unique_words in train',test["num_unique_words"].mean())

Number of characters in the text 

In [None]:

train["num_chars"] = train["question_text"].apply(lambda x: len(str(x)))
test["num_chars"] = test["question_text"].apply(lambda x: len(str(x)))
print('maximum of num_chars in train',train["num_chars"].max())
print("maximum of num_chars in test",test["num_chars"].max())

Number of stopwords in the text

In [None]:
train["num_stopwords"] = train["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
test["num_stopwords"] = test["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
print('maximum of num_stopwords in train',train["num_stopwords"].max())
print("maximum of num_stopwords in test",test["num_stopwords"].max())

Number of punctuations in the text

In [None]:

train["num_punctuations"] =train['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test["num_punctuations"] =test['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
print('maximum of num_punctuations in train',train["num_punctuations"].max())
print("maximum of num_punctuations in test",test["num_punctuations"].max())

Number of title case words in the text

In [None]:

train["num_words_upper"] = train["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test["num_words_upper"] = test["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
print('maximum of num_words_upper in train',train["num_words_upper"].max())
print("maximum of num_words_upper in test",test["num_words_upper"].max())

Number of title case words in the text

In [None]:

train["num_words_title"] = train["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test["num_words_title"] = test["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
print('maximum of num_words_title in train',train["num_words_title"].max())
print("maximum of num_words_title in test",test["num_words_title"].max())

 Average length of the words in the text 

In [None]:

train["mean_word_len"] = train["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test["mean_word_len"] = test["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
print('mean_word_len in train',train["mean_word_len"].max())
print("mean_word_len in test",test["mean_word_len"].max())

We add some new feature to train and test data set now, print columns agains

In [None]:
print(train.columns)
train.head(1)

**<< Note >>**
>**Preprocessing and generation pipelines depend on a model type**

## What is Tokenizer?

In [42]:
import nltk
mystring = "I love Kaggle"
mystring2 = "I'd love to participate in kaggle competitions."
nltk.word_tokenize(mystring)

['I', 'love', 'Kaggle']

In [43]:
nltk.word_tokenize(mystring2)

['I', "'d", 'love', 'to', 'participate', 'in', 'kaggle', 'competitions', '.']

<a id="64"></a> <br>
## 6-4 Data Visualization
**Data visualization**  is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

> * Two** important rules** for Data visualization:
>     1. Do not put too little information
>     1. Do not put too much information

###### [Go to top](#top)

<a id="641"></a> <br>
## 6-4-1 CountPlot

In [None]:
ax=sns.countplot(x='target',hue="target", data=train  ,linewidth=5,edgecolor=sns.color_palette("dark", 3))
plt.title('Is data set imbalance?');

In [None]:
ax = sns.countplot(y="target", hue="target", data=train)
plt.title('Is data set imbalance?');

<a id="642"></a> <br>
## 6-4-2  Pie Plot

In [None]:

ax=train['target'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%' ,shadow=True)
ax.set_title('target')
ax.set_ylabel('')
plt.show()

In [None]:
#plt.pie(train['target'],autopct='%1.1f%%')
 
#plt.axis('equal')
#plt.show()

<a id="643"></a> <br>
## 6-4-3  Histogram

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,10))
train[train['target']==0].num_words.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('target= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
train[train['target']==1].num_words.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('target= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
train[['target','num_words']].groupby(['target']).mean().plot.bar(ax=ax[0])
ax[0].set_title('num_words vs target')
sns.countplot('num_words',hue='target',data=train,ax=ax[1])
ax[1].set_title('num_words:target=0 vs target=1')
plt.show()

In [None]:
# histograms
train.hist(figsize=(15,20))
plt.figure()

In [None]:
train["num_words"].hist();

<a id="644"></a> <br>
## 6-4-4 Violin Plot

In [None]:
sns.violinplot(data=train,x="target", y="num_words")

In [None]:
sns.violinplot(data=train,x="target", y="num_words_upper")

<a id="645"></a> <br>
## 6-4-5 KdePlot

In [None]:
sns.FacetGrid(train, hue="target", size=5).map(sns.kdeplot, "num_words").add_legend()
plt.show()

<a id="646"></a> <br>
## 6-4-6 BoxPlot

In [None]:
train['num_words'].loc[train['num_words']>60] = 60 #truncation for better visuals
axes= sns.boxplot(x='target', y='num_words', data=train)
axes.set_xlabel('Target', fontsize=12)
axes.set_title("Number of words in each class", fontsize=15)
plt.show()

In [None]:
train['num_chars'].loc[train['num_chars']>350] = 350 #truncation for better visuals

axes= sns.boxplot(x='target', y='num_chars', data=train)
axes.set_xlabel('Target', fontsize=12)
axes.set_title("Number of num_chars in each class", fontsize=15)
plt.show()

<a id="646"></a> <br>
## 6-4-6 WordCloud

In [None]:
def generate_wordcloud(text): 
    wordcloud = wc(relative_scaling = 1.0,stopwords = eng_stopwords).generate(text)
    fig,ax = plt.subplots(1,1,figsize=(10,10))
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis("off")
    ax.margins(x=0, y=0)
    plt.show()

In [None]:
text =" ".join(train.question_text)
generate_wordcloud(text)

-----------------
<a id="8"></a> <br>
# 8- Conclusion

This kernel is not completed yet , I have tried to cover all the parts related to the process of **Quora problem** with a variety of Python packages and I know that there are still some problems then I hope to get your feedback to improve it.


you can Fork and Run this kernel on **Github**:
> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)

--------------------------------------

 **I hope you find this kernel helpful and some <font color="red"><b>UPVOTES</b></font> would be very much appreciated** 

<a id="9"></a> <br>

-----------

# 9- References
## 9-1 Kaggle's Kernels
**In the end , I want to thank all the kernels I've used in this notebook**:
1. [SRK](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc)
1. [mihaskalic](https://www.kaggle.com/mihaskalic/lstm-is-all-you-need-well-maybe-embeddings-also)
1. [artgor](https://www.kaggle.com/artgor/eda-and-lstm-cnn)
1. [tunguz](https://www.kaggle.com/tunguz/just-some-simple-eda)

## 9-2 Other References

1. [Machine Learning Certification by Stanford University (Coursera)](https://www.coursera.org/learn/machine-learning/)

1. [Machine Learning A-Z™: Hands-On Python & R In Data Science (Udemy)](https://www.udemy.com/machinelearning/)

1. [Deep Learning Certification by Andrew Ng from deeplearning.ai (Coursera)](https://www.coursera.org/specializations/deep-learning)

1. [Python for Data Science and Machine Learning Bootcamp (Udemy)](Python for Data Science and Machine Learning Bootcamp (Udemy))

1. [Mathematics for Machine Learning by Imperial College London](https://www.coursera.org/specializations/mathematics-machine-learning)

1. [Deep Learning A-Z™: Hands-On Artificial Neural Networks](https://www.udemy.com/deeplearning/)

1. [Complete Guide to TensorFlow for Deep Learning Tutorial with Python](https://www.udemy.com/complete-guide-to-tensorflow-for-deep-learning-with-python/)

1. [Data Science and Machine Learning Tutorial with Python – Hands On](https://www.udemy.com/data-science-and-machine-learning-with-python-hands-on/)
1. [imbalanced-dataset](https://www.quora.com/What-is-an-imbalanced-dataset)
1. [algorithm-choice](https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice)
1. [tokenizing-raw-text-in-python](http://jeffreyfossett.com/2014/04/25/tokenizing-raw-text-in-python.html)
1. [text-analytics101](http://text-analytics101.rxnlp.com/2014/10/all-about-stop-words-for-text-mining.html)
-------------

###### [Go to top](#top)

Go to first step: [Course Home Page](https://www.kaggle.com/mjbahmani/10-steps-to-become-a-data-scientist)

Go to next step : [Titanic](https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python)


#### The kernel is not complete and will be updated soon  !!!