# Categorizing Game

In this task you will get some data, and you will be able to investigate with that data to determine how risky a car is for insurance. The type of car that you own affects the price of insurance. Cars are given a symbol (-3, -2, -1, 0, 1, 2, 3) that indicates how risky a car is. Your job is to use the dataset given to come up with your best prediction for a given set of data. You can use data investigation techniques such as graphing, linear regression to come up with your estimates. Your estimates do not have to be integers, but the answers always are. 

In [2]:
library(dplyr)
library(ggplot2)


Attaching package: ‘dplyr’



The following objects are masked from ‘package:stats’:

    filter, lag



The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## Information about the data

1. Title: 1985 Auto Imports Database

2. Source Information:
   -- Creator/Donor: Jeffrey C. Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
   -- Date: 19 May 1987
   -- Sources:
     1) 1985 Model Import Car and Truck Specifications, 1985 Ward's
        Automotive Yearbook.
     2) Personal Auto Manuals, Insurance Services Office, 160 Water
        Street, New York, NY 10038 
     3) Insurance Collision Report, Insurance Institute for Highway
        Safety, Watergate 600, Washington, DC 20037

4. Relevant Information:
   -- Description
      This data set consists of three types of entities: (a) the
      specification of an auto in terms of various characteristics, (b)
      its assigned insurance risk rating, (c) its normalized losses in use
      as compared to other cars.  The second rating corresponds to the
      degree to which the auto is more risky than its price indicates.
      Cars are initially assigned a risk factor symbol associated with its
      price.   Then, if it is more risky (or less), this symbol is
      adjusted by moving it up (or down) the scale.  Actuarians call this
      process "symboling".  A value of +3 indicates that the auto is
      risky, -3 that it is probably pretty safe.

      The third factor is the relative average loss payment per insured
      vehicle year.  This value is normalized for all autos within a
      particular size classification (two-door small, station wagons,
      sports/speciality, etc...), and represents the average loss per car
      per year.


7. Attribute Information:     
     Attribute:                Attribute Range:
     ------------------        -----------------------------------------------
  1. symbol:                   -3, -2, -1, 0, 1, 2, 3.
  2. make:                     alfa-romero, audi, bmw, chevrolet, dodge, honda,
                               isuzu, jaguar, mazda, mercedes-benz, mercury,
                               mitsubishi, nissan, peugot, plymouth, porsche,
                               renault, saab, subaru, toyota, volkswagen, volvo
  3. fueltype:                 diesel, gas.
  4. aspiration:               std, turbo.
  5. numdoors:                 four, two.
  6. bodystyle:                hardtop, wagon, sedan, hatchback, convertible.
  7. drivewheels:              4wd, fwd, rwd.
  8. enginelocation:           front, rear.
 9. wheelbase:                 continuous from 86.6 120.9.
 10. length:                   continuous from 141.1 to 208.1.
 11. width:                    continuous from 60.3 to 72.3.
 12. height:                   continuous from 47.8 to 59.8.
 13. curbweight:               continuous from 1488 to 4066.
 14. enginetype:               dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
 15. cylinders:                eight, five, four, six, three, twelve, two.
 16. enginesize:               continuous from 61 to 326.
 17. fuelsystem:               1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
 18. bore:                     continuous from 2.54 to 3.94.
 19. stroke:                   continuous from 2.07 to 4.17.
 20. compression-ratio:        continuous from 7 to 23.
 21. horsepower:               continuous from 48 to 288.
 22. peakrpm:                  continuous from 4150 to 6600.
 23. citympg:                  continuous from 13 to 49.
 24. hwympg:                   continuous from 16 to 54.
 25. price:                    continuous from 5118 to 45400.

In [3]:
cars<-read.csv("carinsurance.csv")

In [0]:
str(cars)

## Now take a few minutes and try to find the variables that have the largest impact on the symbol. 

### Round 1

The car in question has the following characteristics
|Attribute | Value |
|----------|-------|
| make            | honda |
| fueltype        |gas|
| aspriation      |std|
| numdoors        |two|
| bodystyle       |hatchback|
| drivewheels     |fwd|
| enginelocation  |front|
| wheelbase       |86.6|
| length          |144.6|
 |width           |63.9|
| height          |50.8|
| curbweight      |1713|
| enginetype      |ohc|
| cylinders       |four|
| enginesize      |92|
| fuelsystem      |1bb1|
| bore            |2.91|
| stroke          |3.41|
| compressionratio|9.6|
| horsepower      |58|
| peakrpm         |4800|
| citympg         |49|
| hwympg          |54|
| price           |6479|

**Make your an estimate for the riskiness of this car. The correct answer is an integer from -3 to 3. -3 being the least risky and 3 being the most. Show your calculations and write an explanation of how you came to your decision.**

### Round 2

The car in question has the following characteristics
|Attribute | Value |
|----------|-------|
| make            | nissan |
| fueltype        |gas|
| aspriation      |std|
| numdoors        |four|
| bodystyle       |sedan|
| drivewheels     |fwd|
| enginelocation  |front|
| wheelbase       |94.5|
| length          |165.3|
 |width           |63.8|
| height          |54.5|
| curbweight      |1971|
| enginetype      |ohc|
| cylinders       |four|
| enginesize      |97|
| fuelsystem      |2bbl|
| bore            |3.15|
| stroke          |3.29|
| compressionratio|9.4|
| horsepower      |69|
| peakrpm         |5200|
| citympg         |31|
| hwympg          |37|
| price           |7499|

**Make your an estimate for the riskiness of this car. The correct answer is an integer from -3 to 3. -3 being the least risky and 3 being the most. Show your calculations and write an explanation of how you came to your decision.**

### Round 3

The car in question has the following characteristics
|Attribute | Value |
|----------|-------|
| make            | volvo |
| fueltype        |diesel|
| aspriation      |turbo|
| numdoors        |four|
| bodystyle       |sedan|
| drivewheels     |rwd|
| enginelocation  |front|
| wheelbase       |109.1|
| length          |188.8|
 |width           |68.9|
| height          |55.5|
| curbweight      |3217|
| enginetype      |ohc|
| cylinders       |six|
| enginesize      |145|
| fuelsystem      |idi|
| bore            |3.01|
| stroke          |3.4|
| compressionratio|23|
| horsepower      |106|
| peakrpm         |4800|
| citympg         |26|
| hwympg          |27|
| price           |22470|

**Make your an estimate for the riskiness of this car. The correct answer is an integer from -3 to 3. -3 being the least risky and 3 being the most. Show your calculations and write an explanation of how you came to your decision.**

### Round 4

The car in question has the following characteristics
|Attribute | Value |
|----------|-------|
| make            | dodge |
| fueltype        |gas|
| aspriation      |std|
| numdoors        |four|
| bodystyle       |wagon|
| drivewheels     |fwd|
| enginelocation  |front|
| wheelbase       |103.3|
| length          |174.6|
 |width           |64.6|
| height          |59.8|
| curbweight      |2535|
| enginetype      |ohc|
| cylinders       |four|
| enginesize      |122|
| fuelsystem      |2bbl|
| bore            |3.34|
| stroke          |3.46|
| compressionratio|8.5|
| horsepower      |88|
| peakrpm         |5000|
| citympg         |24|
| hwympg          |30|
| price           |8921|

**Make your an estimate for the riskiness of this car. The correct answer is an integer from -3 to 3. -3 being the least risky and 3 being the most. Show your calculations and write an explanation of how you came to your decision.**

### Final Score

To calculate your final score we will use a technique called Root Mean Square Error (often written RMSE). To do this find your error for each of the answers by taking the difference from your estimate and the the true answer. Square each of those errors, then calculate the average of those numbers, and take the square root of that. **What is your score?**

**What do you think that this number represents?**