## Exploration of the dataset "Monthly Salary of Public Worker in Brazil"
https://www.kaggle.com/gustavomodelli/monthly-salary-of-public-worker-in-brazil/version/1

In [6]:
import pandas as pd

The CSV file is compressed in the monthly-salary-of-public-worker-in-brazil.zip
After unzip and try to read with pd.read_csv(), we will see various errors due to inconsistencies in the number of fields in some lines.

```python
data = pd.read_csv('monthly_salary_brazil.csv', index_col=0)
```

```
Id,job,sector,Month_salary,13_salary,eventual_salary,indemnity,extra_salary,discount_salary,total_salary
1,OFICIAL ADMINISTRATIVO,DETRAN,2315.810,0.000,0.000,0.000,73.850,0.000,1929.340
2,SD 2C PM,PM,3034.050,0.000,0.000,0.000,651.820,0.000,2265.960
3,1TEN  PM,PM,8990.980,0.000,0.000,0.000,626.750,0.000,6933.040
4,MAJ   PM,SPPREV,13591.020,0.000,0.000,0.000,0.000,0.000,10568.360
...
845,TEC MANUT., PROJETOS E OBRAS,CPTM,3861.030,9538.260,41.810,0.000,0.000,0.000,11782.060
...
```

At line 845 we see the value of job field with an unexpected comma. This problem occurs many times in other lines.

We need to implement the data cleaning before reading it in pandas

In [33]:
import csv

inconsistent_lines_count = 0
cleaned_data_list = []

with open('monthly_salary_brazil.csv', mode='r') as csv_file:
    csv_data = csv_file.readlines()
    
    for idx, line in enumerate(csv_data):
        values = line.rstrip('\r\n').split(',')
        
        if len(values) > 10:
            inconsistent_lines_count += 1
            
            # joining 2nd and 3rd fields
            values[1] = values[1] + values[2]
            
            # removing 3rd field
            del(values[2])
        
        cleaned_data_list.append(values)

print('Number of inconsistent lines found: ', inconsistent_lines_count)

print('Saving cleaned file: monthly_salary_brazil_cleaned.csv')

with open('monthly_salary_brazil_cleaned.csv', mode='w') as cleaned_csv_file:
    csv_writer = csv.writer(cleaned_csv_file, delimiter=',')
    for item in cleaned_data_list:
        csv_writer.writerow(item)

Number of inconsistent lines found:  919
Saving cleaned file: monthly_salary_brazil_cleaned.csv


Finally we can read the cleaned data with pandas:

In [34]:
data = pd.read_csv('monthly_salary_brazil_cleaned.csv', index_col=0)

print(data.info())

  mask |= (ar1 == a)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1085678 entries, 1 to 1085678
Data columns (total 9 columns):
job                1085650 non-null object
sector             1085678 non-null object
Month_salary       1085678 non-null float64
13_salary          1085678 non-null float64
eventual_salary    1085678 non-null float64
indemnity          1085678 non-null float64
extra_salary       1085678 non-null float64
discount_salary    1085678 non-null float64
total_salary       1085678 non-null float64
dtypes: float64(7), object(2)
memory usage: 82.8+ MB
None
