![Food Claim Process in Vivendo](download.png)
# Data Analyst Associate Practical Exam Submission 
## Analysis of Food Claim Process in Vivendo
In this notebook we derive insights about the **Food Claim Process in Vivendo**, a fast food chain in Brazil with over 200 outlets.

As with many fast food establishments, customers make claims against the company. For example, they blame Vivendo for suspected food poisoning. The legal team, who processes these claims, is currently split across four locations. The new head of the legal department wants to see if there are differences in the time it takes to close claims across the locations.
### Tool used for analysis and creating visualizations : TABLEAU

To view the interactive dashboard [Click Here!](https://public.tableau.com/views/DatacampCertification/Dashboard1?:language=en-US&:display_count=n&:origin=viz_share_link) 

## Table of Contents
1. [Data Dictionary](https://app.datacamp.com/workspace/w/fff33e41-de48-4924-a486-edf25e5d19b2#1-data-dictionary---source)
2. [Data Validation](https://app.datacamp.com/workspace/w/fff33e41-de48-4924-a486-edf25e5d19b2#2-data-validation)
3. [Data Discovery and Visualization](https://app.datacamp.com/workspace/w/fff33e41-de48-4924-a486-edf25e5d19b2#3-data-discovery-and-visualization)
4. [Summary](https://app.datacamp.com/workspace/w/fff33e41-de48-4924-a486-edf25e5d19b2#-summary)
5. [Suggestions](https://app.datacamp.com/workspace/w/fff33e41-de48-4924-a486-edf25e5d19b2#suggestions)

## 1. Data Dictionary - [source](https://s3.amazonaws.com/talent-assets.datacamp.com/claims.csv)

<p>Let's familiarize ourselves with the data, it has 8 columns and 98rows. </p>

| Column Name | Datatype| Description |
|:---|:---|:---|
| Claim ID  | Character| the unique identifier of the claim |
| Time to Close | Numeric | Number of days it took for the claim to be closed |
| Claim Amount  | Numeric | Initial claim value in the currency of Brazil |
| Amount Paid | Numeric | Total amount paid after the claim closed in the currency of Brazil |
| Location | Character | Location of the claim, one of “RECIFE”, “SAO LUIS”, “FORTALEZA”, or “NATAL”. |
| Individuals on Claim  | Numeric | Number of individuals on this claim |
| Linked Cases  | Binary | If this claim is believed to be linked with other cases, either TRUE or FALSE |
| Cause  | Character | The cause of the food poisoning injuries, one of ‘vegetable’, ‘meat’, or ‘unknown’ |


## 2. Data Validation
In this section we check the data matches the criteria in the data dictionary. The table below include summary of the validation tasks performed for easy reference: 


| Column Name | Findings (NULL value/ Incorrect datatype / others)| Action Performed |Formula used to new calculated field |
|:---|:---|:---|:---|
| Claim ID  | NIL| NIL |NIL |
| Time to Close | Incorrect value found. (-57 value for the Claim ID → 0000000-00.2019.0.00.0079) | Assumed value to be an error while recording. So, created new calculated field --> "Time_to_close_ modified" to change negative values to positive |`IF [Time to Close]<0 THEN [Time to Close]*-1 ELSE [Time to Close] END` |
| Claim Amount  | Data type is incorrect | Created new calculated field --> "Claim Amount - Modified" to removed "R$" using SPLIT & TRIM and convert into numeric data type |INT(TRIM( SPLIT( [Claim Amount], "R$", 2 ) )) |
| Amount Paid |NIL | NIL |NIL |
| Location |NIL | NIL |NIL |
| Individuals on Claim  | 7 zero values | Replaced the zero values with 1 (Imputed value based on value in the "Amount Paid" column) |IF [Individuals on Claim]=0 AND [Amount Paid] <8364 THEN [Individuals on Claim]+1 ELSE [Individuals on Claim] END |
| Linked Cases  |NIL | NIL |NIL |
| Cause  | 78 NULL value found | Replaced NULL value with 'unknown' and created new column -->"cause_modified" for it |IFNULL([Cause],"unknown") |

Out of the 8 columns, 4 columns (Claim ID, Amount Paid, Location, Linked Cases) did not have any NULL/Error/wrong data type values. Thus, these columns were unchanged since all values were as expected in the data dictionary. 

The detailed changes made for rest of the 4 columns to enable further analysis is given below:

* **Column 2: Time to Close**:	
	- There were no NULL Values. But, unexpected value: -57 seen corresponding to the Claim ID → 0000000-00.2019.0.00.0079  
	- Action performed: Since this column records no. of days, it cant be a negative value. So, assumed -57 value to be an error while recording. Thus, converted -57 to +57 by creating new calculated field (Time_to_close_modified) with the formula: 
 		  	<p>	 IF [Time to Close]<0 <br>
 	 	    THEN [Time to Close]*-1 <br>
					ELSE [Time to Close] <br>
					END</p>
      	      
* **Column 3: Claim Amount**:
	- There were no NULL Values. But, data type of column is incorrect. For example, "R\$50,000.00" should be converted into 50000. So, splitted the value at second position to remove "R$" from the values, trimmed off extra spaces and converted it into integer. 
    - Created new calculated field (Claim Amount - Modified) with the following formula: 
 	    <p> INT(TRIM( SPLIT( [Claim Amount], "R$", 2 ) )) </p>
     
* **Column 6: Individuals on Claim**:
	- Unexpected value seen: 7 zero valued rows.
    - I inferred these 7 zero valued rows could be an error while recording from further analysis into other corresponding columns.  (These 7 rows had unique claim ID's, received an amount towards their claim and was not linked with other cases )
    - Action performed - Replaced the zero values with 1 based on the values in Amount Paid column.
    	(The lowest value for Amount Paid  when 2 individuals claimed is 8364 and range of Amount Paid for 7 zero valued rows is between 999 to 5395. So interpreted that no. of individuals =1 for these 7 zero valued rows)
    - Created new calculated field :Individuals_on_claim modified using formula: 
    	   <p>IF [Individuals on Claim]=0 AND [Amount Paid] <8364 <br>
				THEN [Individuals on Claim]+1 <br>
					ELSE [Individuals on Claim] <br>
					END</p>
* **Column 8: Cause** :
	- NULL Values seen in 78 rows
	- Action performed - replaced NULL value with 'unknown' and created new column (cause_modified) to update the values using formula:
  	  <p>IFNULL([Cause],"unknown") </p>


## 3. Data Discovery and Visualization
### (i) How does the number of claims differ across locations?
* The legal team, who processes these claims, is currently split across four locations - “RECIFE”, “SAO LUIS”, “FORTALEZA”,“NATAL”. First I am checking number of claims across locations.
* **Inference from graph below:** The highest claims received is in "Sao Luis" and lowest in "Natal". So, need to dive deeper into criteria of how data is distributed across different location and find any relation with other fields.
![Number of claims Vs Locations](Count_vs_location.png)

   
### (ii) What is the distribution of time to close claims?
* Now, looking into how the distribution of time to close claims across overall locations.
* **Inference from graph below::** The distribution of time to close claims is not uniform. The highest no. of claims closed is between 400-600 days and very few (around 4) cases takes longer than 2400 days.
![Distribution of Time to close claims overall location](distribution_across_location.png)

* To get more insights about distribution of time to close claims across overall locations used a box plot.
* **Inference from graph below::** We understand that 50 percentage of total claims are getting closed before 638 days and 75 percentage of total claims are closed before 1143 days. So, need to check each locations contribution towards this.
![Distribution of Time to close claims box plot](distribution_box_plot.png)

* Finding how the distribution of time to close claims differ in each location.
* **Inference from graph below::** We can see "Sao Luis" has the highest no. of days to close 50 percentage and 75 percentage of claims compared to other locations. So, need to check any factor affecting this.
![Distribution of Time to close claims for each location](each_location_distribution.png)

### (iii) How does the average time to close claims differ by location?
* Finding how does the average time to close claims differ by location
* **Inference from graph below::** The average time to close claims is highest in location - Sao Luis. It might be because the number of claim is the highest in Sao Luis. So, need to investigate more about it.
![Average time to close claims Vs location](avg_time_location.png)
### (iv) Checking if any relation in Avg. time to close claims and other measures
* We sre analysing why Avg. Time to close is highest is Sao Luis by checking its relation with other measures.
* **Inference from graph and table below::** It seems as the count of claims increase the Avg. time to close also increases.But, we are unable to deduce a relationship between  Avg. time to close and other measures (Avg. Amount paid, Avg. Individuals on claim).
![TIME_TO_CLOSE_VS_COUNT.png](TIME_TO_CLOSE_VS_COUNT.png)
![Time_to_close_Vs_Amount_paid](Time_to_close_Vs_Amount_paid.png)


## ✅ Summary
- The highest claims received is in "Sao Luis" and lowest in "Natal". 
- The distribution of time to close claims is not uniform. The highest no. of claims closed is between 400-600 days and very few (around 4) cases takes longer than 2400 days.
- On further analysis, we understand that 50 percentage of total claims are getting closed before 638 days and 75 percentage of total claims are closed before 1143 days. 
- Further investigating each locations total claims closed, We can see "Sao Luis" has the highest no. of days to close 50 percentage and 75 percentage of claims compared to other locations.
- The average time to close claims is highest in location - Sao Luis. 
- It seems as the count of claims increase the Avg. time to close also increases. But, we are unable to deduce a relationship between  Avg. time to close and other measures (Avg. Amount paid, Avg. Individuals on claim).


## Suggestions
- From analysis, we understand highest Time to close claims in "Sao Luis" might be because of the highest claims count. So, it requires more focus.
- Need more data about location "Sao Luis" to further analyse if any other factors like count of employees, details about processes in hadling claims etc. affecting the Time to close claims.