Your company needs to launch a product targeted for:
* Aries (Group A)
* Taurus (Group B)
* Gemini (Group C)
* Cancer (Group D)
* Leo (Group E)
* Virgo (Group F)
* Libra (Group G)
* Scorpio (Group H)
* Sagittarius (Group I)
* Capricorn (Group J)
* Aquarius (Group Z)
* Pisces  (No Group)

You are aware that you will need to market it using different strategies for the potential customers that consider that believe in the horoscope and the one that do not. After some hours of looking for a dataset, you found out one that suites your interest:
* Propose a product for the believers in the horoscope and another product for the non-believers
* Design a campaign to market these two products

Please, submit the presentation and report for the data visualization analysis. Notice, that it should be a zip file containing the report (pdf), and data analysis (jupyter notebook) files

The report should consists of 2500-3000 words (no more than 10 pages)

# Load the data

We start by loading the libraries

In [2]:
import pandas as pd

We load the data

In [3]:
df = pd.read_csv('PartnerData/profiles.csv')

In [5]:
df.sign.unique()

array(['gemini', 'cancer', 'pisces but it doesn&rsquo;t matter', 'pisces',
       'aquarius', 'taurus', 'virgo', 'sagittarius',
       'gemini but it doesn&rsquo;t matter',
       'cancer but it doesn&rsquo;t matter',
       'leo but it doesn&rsquo;t matter', nan,
       'aquarius but it doesn&rsquo;t matter',
       'aries and it&rsquo;s fun to think about',
       'libra but it doesn&rsquo;t matter',
       'pisces and it&rsquo;s fun to think about', 'libra',
       'taurus but it doesn&rsquo;t matter',
       'sagittarius but it doesn&rsquo;t matter',
       'scorpio and it matters a lot',
       'gemini and it&rsquo;s fun to think about',
       'leo and it&rsquo;s fun to think about',
       'cancer and it&rsquo;s fun to think about',
       'libra and it&rsquo;s fun to think about',
       'aquarius and it&rsquo;s fun to think about',
       'virgo but it doesn&rsquo;t matter',
       'scorpio and it&rsquo;s fun to think about',
       'capricorn but it doesn&rsquo;t matter', 'sc

In [6]:
df.shape

(59946, 31)

In [12]:
# We need to select the categories from our sign
Selector = (df.sign == 'pisces') | (df.sign == 'pisces but it doesn&rsquo;t matter') | (df.sign == 'pisces and it&rsquo;s fun to think about') | (df.sign == 'pisces and it matters a lot')
# Alternatively we can use the method isin
Selector = (df.sign.isin(['pisces','pisces but it doesn&rsquo;t matter','pisces and it&rsquo;s fun to think about','pisces and it matters a lot'])
df = df[Selector]
df.shape

(3946, 31)

In [0]:
<FILL>

We inspect the first five elements of the dataset

In [13]:
df.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
16,33,fit,,socially,,working on masters program,"i just moved to the bay area from austin, tx (...","making music, programming, getting back into a...","i'm from louisiana, so cooking and eating are ...","lately, i keep getting asked ""are you with the...",...,"oakland, california",,straight,likes dogs and likes cats,,m,pisces but it doesn&rsquo;t matter,sometimes,"english (fluently), c++ (fluently), german (po...",single
19,33,athletic,mostly anything,socially,never,graduated from masters program,i relocated to san francisco half a year ago. ...,"i left my comfort zone far behind in europe, a...","listening, connecting emotionally, analyzing t...","cheerful, open, curious, direct, active, sport...",...,"san francisco, california",doesn&rsquo;t have kids,straight,likes dogs and likes cats,catholicism but not too serious about it,m,pisces and it&rsquo;s fun to think about,no,english (fluently),single
41,35,athletic,strictly anything,,,graduated from college/university,-i grew up in new york and san diego half and ...,,scrambling eggs<br />\ntaboo<br />\nobscure/us...,once at a party a self confessed psychic told ...,...,"oakland, california","doesn&rsquo;t have kids, but wants them",straight,,agnosticism,m,pisces,no,english,single


What is the format of the dataset

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3946 entries, 2 to 59939
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          3946 non-null   int64  
 1   body_type    3620 non-null   object 
 2   diet         2430 non-null   object 
 3   drinks       3826 non-null   object 
 4   drugs        2989 non-null   object 
 5   education    3602 non-null   object 
 6   essay0       3614 non-null   object 
 7   essay1       3513 non-null   object 
 8   essay2       3394 non-null   object 
 9   essay3       3306 non-null   object 
 10  essay4       3323 non-null   object 
 11  essay5       3331 non-null   object 
 12  essay6       3155 non-null   object 
 13  essay7       3245 non-null   object 
 14  essay8       2810 non-null   object 
 15  essay9       3205 non-null   object 
 16  ethnicity    3626 non-null   object 
 17  height       3946 non-null   float64
 18  income       3946 non-null   int64  
 19  job  

We see that most of the variables are not properly coded by int, float or string (see object in column Dtype)

### The first step is to classify the variables in the proper format

We have three possibilities:
* Categorical (yellow, blue, green)
* Ordinal (first, second, third)
* Continous (1.76cm)

#### Example: Categorical

In [15]:
df.body_type.unique()

array(['thin', 'fit', 'athletic', nan, 'average', 'curvy',
       'a little extra', 'skinny', 'overweight', 'full figured',
       'used up', 'rather not say', 'jacked'], dtype=object)

We define it as categorical variable

In [19]:
df.body_type = df.body_type.astype('category')

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   age          59946 non-null  int64   
 1   body_type    54650 non-null  category
 2   diet         35551 non-null  object  
 3   drinks       56961 non-null  object  
 4   drugs        45866 non-null  object  
 5   education    53318 non-null  object  
 6   essay0       54458 non-null  object  
 7   essay1       52374 non-null  object  
 8   essay2       50308 non-null  object  
 9   essay3       48470 non-null  object  
 10  essay4       49409 non-null  object  
 11  essay5       49096 non-null  object  
 12  essay6       46175 non-null  object  
 13  essay7       47495 non-null  object  
 14  essay8       40721 non-null  object  
 15  essay9       47343 non-null  object  
 16  ethnicity    54266 non-null  object  
 17  height       59943 non-null  float64 
 18  income       59946 non-nul

We can now see in row 1 that the variable is a category now

#### Example: Ordinal

In [16]:
df.smokes.unique()

array(['no', 'sometimes', 'when drinking', 'trying to quit', 'yes', nan],
      dtype=object)

In [17]:
from pandas.api.types import CategoricalDtype

# We define it as categorical variable
df.smokes = df.smokes.astype('category')

# We provide an order in the following list that it will be used later on
categories_unordered = ['no', 'sometimes', 'when drinking', 'trying to quit', 'yes']

# We convert it an save it
df.smokes = df.smokes.cat.reorder_categories(categories_unordered, ordered=True)
df.smokes

2               no
3               no
16       sometimes
19              no
41              no
           ...    
59898          yes
59925           no
59929           no
59934           no
59939           no
Name: smokes, Length: 3946, dtype: category
Categories (5, object): [no < sometimes < when drinking < trying to quit < yes]

Let's check that we have properly done it (see column 28)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3946 entries, 2 to 59939
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   age          3946 non-null   int64   
 1   body_type    3620 non-null   object  
 2   diet         2430 non-null   object  
 3   drinks       3826 non-null   object  
 4   drugs        2989 non-null   object  
 5   education    3602 non-null   object  
 6   essay0       3614 non-null   object  
 7   essay1       3513 non-null   object  
 8   essay2       3394 non-null   object  
 9   essay3       3306 non-null   object  
 10  essay4       3323 non-null   object  
 11  essay5       3331 non-null   object  
 12  essay6       3155 non-null   object  
 13  essay7       3245 non-null   object  
 14  essay8       2810 non-null   object  
 15  essay9       3205 non-null   object  
 16  ethnicity    3626 non-null   object  
 17  height       3946 non-null   float64 
 18  income       3946 non-null 

#### Example: Continous

In [24]:
df.income.unique()

array([     -1,   20000,  150000,   40000,  100000,   30000,   70000,
         50000,   80000,   60000, 1000000,  250000,  500000])

It is already in the proper format!

### Exceptions: Time
One variable is last_online, you need to convert it to a time structure (see slides and DatetimeIndex)

In [0]:
<FILL>

### Exceptions: Text
We are not working with text, so we delete it

#### Let's do with essay0

In [24]:
# Let's see the data
df.essay0[0:5]

0    about me:<br />\n<br />\ni would love to think...
1    i am a chef: this is what that means.<br />\n1...
2    i'm not ashamed of much, but writing public te...
3            i work in a library and go to school. . .
4    hey how's it going? currently vague on the pro...
Name: essay0, dtype: object

We are going to drop the text because we are not goint to use it

In [25]:
df.drop(columns=['essay0'],inplace=True)

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 30 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   age          59946 non-null  int64   
 1   body_type    54650 non-null  category
 2   diet         35551 non-null  object  
 3   drinks       56961 non-null  object  
 4   drugs        45866 non-null  object  
 5   education    53318 non-null  object  
 6   essay1       52374 non-null  object  
 7   essay2       50308 non-null  object  
 8   essay3       48470 non-null  object  
 9   essay4       49409 non-null  object  
 10  essay5       49096 non-null  object  
 11  essay6       46175 non-null  object  
 12  essay7       47495 non-null  object  
 13  essay8       40721 non-null  object  
 14  essay9       47343 non-null  object  
 15  ethnicity    54266 non-null  object  
 16  height       59943 non-null  float64 
 17  income       59946 non-null  int64   
 18  job          51748 non-nul

#### [ACTION] You should apply it to the rest of the variables until there is no more objects

In [0]:
<Fill>

### The second step is to solve the missing values

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   age          59946 non-null  int64   
 1   body_type    54650 non-null  category
 2   diet         35551 non-null  object  
 3   drinks       56961 non-null  object  
 4   drugs        45866 non-null  object  
 5   education    53318 non-null  object  
 6   essay0       54458 non-null  object  
 7   essay1       52374 non-null  object  
 8   essay2       50308 non-null  object  
 9   essay3       48470 non-null  object  
 10  essay4       49409 non-null  object  
 11  essay5       49096 non-null  object  
 12  essay6       46175 non-null  object  
 13  essay7       47495 non-null  object  
 14  essay8       40721 non-null  object  
 15  essay9       47343 non-null  object  
 16  ethnicity    54266 non-null  object  
 17  height       59943 non-null  float64 
 18  income       59946 non-nul

#### Let's start with the first variable

Let's figure out what can we do with the missing values. Let's the percentage

In [27]:
print(str(round(sum(df.body_type.isna())/len(df.body_type)*100)) + '%')

9%


##### As we have it less than 10%, we either substitute it by:
* the average (if it was a continuous variable). You should use Series.mean()
* the median (if it is ordinal). You should use Series.median()
* the mode (if it is categorical). You should use Series.mode()

In [28]:
df.body_type.mode()[0]

'average'

In [29]:
df.body_type = df.body_type.fillna(df.body_type.mode()[0])

In [30]:
# Let's check it out
print(str(round(sum(df.body_type.isna())/len(df.body_type)*100)) + '%')

0%


In [31]:
df.body_type.unique()

[a little extra, average, thin, athletic, fit, ..., full figured, jacked, rather not say, used up, overweight]
Length: 12
Categories (12, object): [a little extra, average, thin, athletic, ..., jacked, rather not say, used up, overweight]

#### [ACTION] You should apply it to the rest of the variables until there is no more objects

## [Analytics] The next element is to understand the main characteristics for male and female depending on each variables