# Pandas - Cleaning data - casting datatypes and handling missing values

## Table of contents

* [Detecting missing values: `DataFrame.isnull()` or `DataFrame.isna()` methods](#Detecting-missing-values:-DataFrame.isnull()-or-DataFrame.isna()-methods)
* [Filling missing values: `DataFrame.fillna()` method](#Filling-missing-values:-DataFrame.fillna()-method)
* [Dropping missing values: `DataFrame.dropna()` method](#Dropping-missing-values:-DataFrame.dropna()-method)
* [Handling custom missing values](#Handling-custom-missing-values)
    * [Handling custom missing values at the `DataFrame` level: `DataFrame.replace()` method](#Handling-custom-missing-values-at-the-DataFrame-level:-DataFrame.replace()-method)
    * [Handling custom missing values at the `.csv` import level: `pandas.read_csv()` method's `na_values=` argument](#Handling-custom-missing-values-at-the-.csv-import-level:-pandas.read_csv()-method's-na_values=-argument)
* [Data types: `DataFrame.dtypes` attribute and Casting data types: `DataFrame.astype({dict})` or `Series.astype()` methods](#Data-types:-DataFrame.dtypes-attribute)
* [Now, let's go back to the stack overflow dataset and work with it](#Now,-let's-go-back-to-the-stack-overflow-dataset-and-work-with-it)
    

***

In [386]:
import pandas as pd
import numpy as np

In [387]:
people = {
    'first': ['Corey', 'Jane', 'John', 'Chris', np.nan, None, 'NA'], 
    'last': ['Schafer', 'Doe', 'Doe', 'Schafer', np.nan, np.nan, 'Missing'], 
    'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com', None, np.nan, 'Anonymous@email.com', 'NA'],
    'age': ['33', '55', '63', '36', None, None, 'Missing']
}

Notice that we have added a few missing values to our dictionary above:
- `np.nan`: Not A Number (NAN) value from the `numpy` library
- `None` values
- Some custom missing values with the strings 'NA' and 'Missing' 

In [388]:
df_ppl = pd.DataFrame(people)

In [389]:
df_ppl

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


## Detecting missing values: `DataFrame.isnull()` or `DataFrame.isna()` methods

Before we start handling the missing values, it is best to first 'detect' missing values in a `DataFrame`. That is what `DataFrame.isnull()` or `DataFrame.isna()` methods help us with.

We have looked at `DataFrame.isnull()` in the *'pandas_filtering'* notebook earlier.

`DataFrame.isna()` is identical to `DataFrame.isnull()`. These two methods do exactly the same thing. Even their docs are identical.

**But why have two methods with different names do the same thing?**

> This is because `pandas`' DataFrames are based on R's DataFrames. In R `na` and `null` are two separate things.

> However, in python, `pandas` is built on top of `numpy`, which has neither `na` nor `null` values. Instead `numpy` has `NaN` values (which stands for "Not a Number"). Consequently, `pandas` also uses `NaN` values.

Both, the `DataFrame.isnull()` or `DataFrame.isna()` methods, return a boolean same-sized object indicating if the values are missing.

In [390]:
df_ppl.isnull()

Unnamed: 0,first,last,email,age
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,True,False
4,True,True,True,True
5,True,True,False,True
6,False,False,False,False


In [391]:
df_ppl.isna()

Unnamed: 0,first,last,email,age
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,True,False
4,True,True,True,True
5,True,True,False,True
6,False,False,False,False


## Filling missing values: `DataFrame.fillna()` method

The `fillna()` method is used to fill `NaN` values with values of our choice (or values using the specified method).

Let's say that we have a `DataFrame` with student marks for some assignments. If a student does not submit an assignment, it is initially marked as `NaN`. However, while calculating grades, we want to replace these `NaN` values with 0 marks. That is where `fillna()`can be used: 

In [392]:
marks = {
    'name': ['Carl', 'Dave', 'Eva', 'Khalid', 'Ana', 'Ole', 'Kristin'], 
    'assignment_1': [8, 7, 5, 10, 6, 10, 8], 
    'assignment_2': [5, 7, np.nan, 8, 8, np.nan, np.nan],
    'assignment_3': [np.nan, 4, 3, 7, 4, 7, 6],
    'assignment_4': [np.nan, 3, 7, 7, np.nan, 8, 7],
    'assignment_5': [9, 6, np.nan, 9, 5, 9, np.nan],
}

In [393]:
df_marks = pd.DataFrame(marks)

In [394]:
df_marks

Unnamed: 0,name,assignment_1,assignment_2,assignment_3,assignment_4,assignment_5
0,Carl,8,5.0,,,9.0
1,Dave,7,7.0,4.0,3.0,6.0
2,Eva,5,,3.0,7.0,
3,Khalid,10,8.0,7.0,7.0,9.0
4,Ana,6,8.0,4.0,,5.0
5,Ole,10,,7.0,8.0,9.0
6,Kristin,8,,6.0,7.0,


In [395]:
df_marks.fillna(0, inplace=True)

In [396]:
df_marks

Unnamed: 0,name,assignment_1,assignment_2,assignment_3,assignment_4,assignment_5
0,Carl,8,5.0,0.0,0.0,9.0
1,Dave,7,7.0,4.0,3.0,6.0
2,Eva,5,0.0,3.0,7.0,0.0
3,Khalid,10,8.0,7.0,7.0,9.0
4,Ana,6,8.0,4.0,0.0,5.0
5,Ole,10,0.0,7.0,8.0,9.0
6,Kristin,8,0.0,6.0,7.0,0.0


This would then enable us to calculate a correct *'mean'* for each student.

In [397]:
df_marks['mean'] = df_marks.mean(axis='columns')

In [398]:
df_marks

Unnamed: 0,name,assignment_1,assignment_2,assignment_3,assignment_4,assignment_5,mean
0,Carl,8,5.0,0.0,0.0,9.0,4.4
1,Dave,7,7.0,4.0,3.0,6.0,5.4
2,Eva,5,0.0,3.0,7.0,0.0,3.0
3,Khalid,10,8.0,7.0,7.0,9.0,8.2
4,Ana,6,8.0,4.0,0.0,5.0,4.6
5,Ole,10,0.0,7.0,8.0,9.0,6.8
6,Kristin,8,0.0,6.0,7.0,0.0,4.2


## Dropping missing values: `DataFrame.dropna()` method

In [399]:
df_ppl

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


In [400]:
df_ppl.dropna()

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
6,,Missing,,Missing


Note that `.dropna()` drops all the **ROWS** with **ANY** `NaN` or `None` values, but it does not drop the rows with *custom missing values* (i.e. the strings 'NA' and 'Missing').

Now, the `.dropna()` method above runs with some *implicit* default arguments. 

Let's re-run the `.dropna()` method with the same default arguments, but *explicit*:

In [401]:
df_ppl.dropna(axis='index', how='any')

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
6,,Missing,,Missing


The `axis='index', how='any'` arguments effectively mean: 'Drop **ROWS** with **ANY** missing values', respectively.

If we set these arguments to `axis='index', how='all'`, we would 'Drop **ROWS** with **ALL** missing values'.

In [402]:
df_ppl.dropna(axis='index', how='all')

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
5,,,Anonymous@email.com,
6,,Missing,,Missing


Now, let's say that the only value that is mandatory for us, and hence would be the criteria for including or dropping a row, is the *'email'*. Then we could use the `subset=` argument:

In [403]:
df_ppl

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


In [404]:
df_ppl.dropna(axis='index', how='any', subset=['email'])

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
5,,,Anonymous@email.com,
6,,Missing,,Missing


Now,since we have a `subset=` argument with only 1 column name passed into it, the `how=` argument becomes irrelevant. 

The `how=` argument would be relevant only when:
- Either there is NO `subset=` argument defined for the `.dropna()` method, i.e. the `.dropna()` applies to all the columns of the `DataFrame`, 
- Or the `subset=` argument is defined with MULTIPLE column names passed into it.

Now, if we want to drop all the **ROWS** with **ANY** of the *'last'* or *'email'* column values missing: 

In [405]:
df_ppl

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


In [406]:
df_ppl.dropna(axis='index', how='any', subset=['last', 'email'])

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
6,,Missing,,Missing


On all the `.dropna()` executions above, we can see that it returns a `DataFrame` with the missing values dropped, but does not make changes *inplace*.

To make changes *inplace* we can just set the argument `inplace=True`. But we are not going to do that here.

## Handling custom missing values

We could handle the custom missing values at:
- Either the `DataFrame` level,
- Or the `.csv` import level.

### Handling custom missing values at the `DataFrame` level: `DataFrame.replace()` method

We can use the `DataFrame.replace()` method to replace custom missing values with `np.nan`.

In [407]:
people = {
    'first': ['Corey', 'Jane', 'John', 'Chris', np.nan, None, 'NA'], 
    'last': ['Schafer', 'Doe', 'Doe', 'Schafer', np.nan, np.nan, 'Missing'], 
    'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com', None, np.nan, 'Anonymous@email.com', 'NA'],
    'age': ['33', '55', '63', '36', None, None, 'Missing']
}

In [408]:
df_a = pd.DataFrame(people)

In [409]:
df_a

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


In [410]:
df_a.replace('NA', np.nan, inplace=True)
df_a.replace('Missing', np.nan, inplace=True)
df_a

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33.0
1,Jane,Doe,JaneDoe@email.com,55.0
2,John,Doe,JohnDoe@email.com,63.0
3,Chris,Schafer,,36.0
4,,,,
5,,,Anonymous@email.com,
6,,,,


Now, since the *custom missing values* have been replaced with `np.nan`, we can re-run the `.dropna()` methods on the updated 
`DataFrame` above. We are not going to do that here.

### Handling custom missing values at the `.csv` import level: `pandas.read_csv()` method's `na_values=` argument

This way of handling custom missing values is demonstrated in the last section, where we import the *'survey_results_public.csv'* file.

## Data types: `DataFrame.dtypes` attribute
## and
## Casting data types: `DataFrame.astype({dict})` or `Series.astype()` methods

In [411]:
df_a

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33.0
1,Jane,Doe,JaneDoe@email.com,55.0
2,John,Doe,JohnDoe@email.com,63.0
3,Chris,Schafer,,36.0
4,,,,
5,,,Anonymous@email.com,
6,,,,


`DataFrame.dtype` attribute returns a `Series` with the *data type* of each column. The result's `index` is the original `DataFrame`'s columns.

In [432]:
df_a.dtypes

first     object
last      object
email     object
age      float64
dtype: object

> **A brief explanation of the difference between Python `type()` and NumPy `dtype` objects:**<br><br>
Python defines only one type of a particular object (there is only one integer type, one floating-point type, etc.). This can be checked by using the `type()` function on the object.<br><br>
In `numpy`, there are 24 new fundamental Python types (each being a `dtype` object, that is an *instance* of the `numpy.dtype` class) to describe different types of scalars. A `dtype` object describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. These type descriptors are mostly based on the types available in the C-language, that CPython is written in, with several additional types compatible with Python’s types. The `dtype` object arrays in a `DataFrame` or a `Series` are checked with the `DataFrame.dtypes` or `Series.dtypes` attributes, respectively.<br><br>
For further reading on NumPy `dtype` objects: https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html#arrays-dtypes

Taking the *'mean'* of the *'age'* column from the `DataFrame` above, using `df_a['age'].mean()`, will give us a `TypeError` because the column data type is currently `str`, that does not have the `.mean()` method.

So we need to *'cast'* (convert) the *'age'* column to a *numeric* (`int` or `float`) data type. This is know as *'data type casting'*.

The question is: Should we convert the *'age'* column to `int` or `float` data type.

The caveat here is that, if the column that needs to be casted to a *numeric* data type, contains any `NaN` values, they have to be casted to the `float` data type. The reason for this is that, under the hood, the `np.nan` data type is `float`, and hence can not be casted to `int` (i.e. trying to cast `float` to `int `will result in a `TypeError`)

In [413]:
type(np.nan)

float

In [414]:
type(None)

NoneType

The `astype()` method is used to cast a pandas object to a specified `dtype`. The object returned has the same type as the caller.

We could use `DataFrame.astype({dict})` or `Series.astype()` methods.

In [415]:
df_a['age'] = df_a['age'].astype(float)

We could also called the `.astype()` method on the `DataFrame` object, as shown below (and commented out), with the same effect:

In [416]:
# df_a = df_a.astype({'age': float})

In [417]:
df_a.dtypes

first     object
last      object
email     object
age      float64
dtype: object

Now that the `dtype` of the *'age'* column has been casted to `float`, we can run the `.mean()` method on it:

In [418]:
df_a['age'].mean()

46.75

***

## Now, let's go back to the stack overflow dataset and work with it

In [419]:
df = pd.read_csv('work_directory/pandas/data/survey_results_public.csv', index_col='Respondent', na_values=['NA', 'Missing'])
df_schema = pd.read_csv('work_directory/pandas/data/survey_results_schema.csv', index_col='Column')

In [420]:
pd.set_option('display.max_rows', 85)
pd.set_option('display.max_columns', 85)

In [421]:
df.head()

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1
1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",,,4.0,10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Django;Flask,Flask;jQuery,Node.js,Node.js,IntelliJ;Notepad++;PyCharm,Windows,I do not use containers,,,Yes,"Fortunately, someone else has that title",Yes,Twitter,Online,Username,2017,A few times per month or weekly,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,31-60 minutes,No,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are",Neutral,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,,"Developer, desktop or enterprise applications;...",,17,,,,,,,I am actively looking for a job,I've never had a job,,,Financial performance or funding status of the...,"Something else changed (education, award, medi...",,,,,,,,,,,,,,,,,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Django,Django,,,Atom;PyCharm,Windows,I do not use containers,,Useful across many domains and could change ma...,Yes,Yes,Yes,Instagram,Online,Username,2017,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Taught yourself a new language, framework, or ...",100 to 499 employees,"Designer;Developer, back-end;Developer, front-...",3.0,22,1,Slightly satisfied,Slightly satisfied,Not at all confident,Not sure,Not sure,"I’m not actively looking, but I am open to new...",1-2 years ago,Interview with people in peer roles,No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,THB,Thai baht,23000.0,Monthly,8820.0,40.0,There's no schedule or spec; I work on what se...,Distracting work environment;Inadequate access...,Less than once per month / Never,Home,Average,No,,"No, but I think we should",Not sure,I have little or no influence,HTML/CSS,Elixir;HTML/CSS,PostgreSQL,PostgreSQL,,,,Other(s):,,,Vim;Visual Studio Code,Linux-based,I do not use containers,,,Yes,Yes,Yes,Reddit,In real life (in person),Username,2011,A few times per week,Find answers to specific questions;Learn how t...,6-10 times per week,They were about the same,,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,100 to 499 employees,"Developer, full-stack",3.0,16,Less than 1 year,Very satisfied,Slightly satisfied,Very confident,No,Not sure,I am not interested in new job opportunities,Less than a year ago,"Write code by hand (e.g., on a whiteboard);Int...",No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,61000.0,Yearly,61000.0,80.0,There's no schedule or spec; I work on what se...,,Less than once per month / Never,Home,A little below average,No,,"No, but I think we should",Developers typically have the most influence o...,I have little or no influence,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,,,.NET,.NET,Eclipse;Vim;Visual Studio;Visual Studio Code,Windows,I do not use containers,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Daily or almost daily,Find answers to specific questions;Pass the ti...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,"10,000 or more employees","Academic researcher;Developer, desktop or ente...",16.0,14,9,Very dissatisfied,Slightly dissatisfied,Somewhat confident,Yes,No,I am not interested in new job opportunities,Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,"Industry that I'd be working in;Languages, fra...",I was preparing for a job search,UAH,Ukrainian hryvnia,,,,55.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Inadequ...,A few days each month,Office,A little above average,"Yes, because I see value in code review",,"Yes, it's part of our process",Not sure,I have little or no influence,C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA,HTML/CSS;Java;JavaScript;SQL;WebAssembly,Couchbase;MongoDB;MySQL;Oracle;PostgreSQL;SQLite,Couchbase;Firebase;MongoDB;MySQL;Oracle;Postgr...,Android;Linux;MacOS;Slack;Windows,Android;Docker;Kubernetes;Linux;Slack,Django;Express;Flask;jQuery;React.js;Spring,Flask;jQuery;React.js;Spring,Cordova;Node.js,Apache Spark;Hadoop;Node.js;React Native,IntelliJ;Notepad++;Vim,Linux-based,"Outside of work, for personal projects",Not at all,,Yes,Also Yes,Yes,Facebook,In real life (in person),Username,I don't remember,Multiple times per day,Find answers to specific questions,More than 10 times per week,Stack Overflow was much faster,,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...","Yes, definitely",Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


Now, let's say we want to find out the **average years of coding experience for the respondents** in our `DataFrame`.

In [422]:
df['YearsCode'].head(10)

Respondent
1       4
2     NaN
3       3
4       3
5      16
6      13
7       6
8       8
9      12
10     12
Name: YearsCode, dtype: object

In [423]:
df['YearsCode'].dtypes

dtype('O')

In [424]:
type('O')

str

`dtype('O')` in `numpy` is a `str` object in Python. (Read: 'A brief explanation of the difference between Python `type()` and NumPy `dtype` objects')

Calling the `.mean()` method on the *'YearsCode'* column would return a `TypeError` because the `dtype`for this column is a *string* object.

Therefore, we need to *cast* the data type of this column to `float`.

In [425]:
# df['YearsCode'].astype(float)

The command commented in above will raise a `TypeError` because there is a string: 'Less than 1 year', present in the values of the column.

We could not *cast* the data type of the *'YearsCode'* column to `float` because there is one (or more) string(s) present in the values of the column.

Let's have a look at all the unique values in the column before attempting to *cast* it to `float` again.

We could use the `.value_counts()` method for a tally of the unique values, or we could simply use the `.unique()` method if we just want to see all the unique values without their count.

In [426]:
df['YearsCode'].unique()

array(['4', nan, '3', '16', '13', '6', '8', '12', '2', '5', '17', '10',
       '14', '35', '7', 'Less than 1 year', '30', '9', '26', '40', '19',
       '15', '20', '28', '25', '1', '22', '11', '33', '50', '41', '18',
       '34', '24', '23', '42', '27', '21', '36', '32', '39', '38', '31',
       '37', 'More than 50 years', '29', '44', '45', '48', '46', '43',
       '47', '49'], dtype=object)

Let's just replace the unique `str` values with the most representative *numeric* values.

Note: we do not have to do anything about the `NaN` values in this column since the `.mean()` method would ignore these when calculating the *mean*.

In [427]:
df['YearsCode'].replace('Less than 1 year', 0, inplace=True)

In [428]:
df['YearsCode'].replace('More than 50 years', 51, inplace=True)

In [429]:
df['YearsCode'].unique()

array(['4', nan, '3', '16', '13', '6', '8', '12', '2', '5', '17', '10',
       '14', '35', '7', 0, '30', '9', '26', '40', '19', '15', '20', '28',
       '25', '1', '22', '11', '33', '50', '41', '18', '34', '24', '23',
       '42', '27', '21', '36', '32', '39', '38', '31', '37', 51, '29',
       '44', '45', '48', '46', '43', '47', '49'], dtype=object)

In [433]:
df['YearsCode'] = df['YearsCode'].astype(float)

In [434]:
df['YearsCode'].dtypes

dtype('float64')

In [431]:
df['YearsCode'].mean()

11.647779751332148

So the **average years of coding experience for the respondents is about 11.6 years**.