Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get Autopilot to train correctly #31

Closed
chilipeppr opened this issue Sep 14, 2020 · 85 comments
Closed

Can't get Autopilot to train correctly #31

chilipeppr opened this issue Sep 14, 2020 · 85 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects

Comments

@chilipeppr
Copy link

Thanks again for the great work on this project.

I've spent a couple days now trying to get the Autopilot to train and nothing has quite worked for me. All I get when I turn the Network on after training/post-processing/recompiling the Android app is the OpenBot driving in a slow straight line and crashing into the wall.

Here's what I've gone through thus far...

  1. To train, I use the default Data Logger of crop_img. I have the Model set to AUTOPILOT_F and I set the Device to GPU. I leave Drive Mode set to Controller and then I turn on Logging from the XBox controller by hitting the A button. I hear the MP3 file say "Logging started" and then I start driving around my kitchen.

WIN_20200913_18_17_07_Pro

  1. Once I've created about 5 minutes worth of data from driving around I turn off Logging by hitting A again on the XBox controller. I hear the MP3 file play of "Logging stopped". This part seems fine.

  2. I download the Zip file of the logging and place it the policy folder. I'm showing the hierarchy here because your docs say to create a folder called "train" but the Python script looks for "train_data". I also initially didn't realize you had to create manual folders for your set of log data, so I now have that correct such that I do get through the Jupiter Notebook process fine rather than failing on Step 10, which is what happens if you create your folder structure incorrectly.
    image

  3. My images seem to be fine. The resolution is small at 256x96 but I presume that's the correct size for the crop_img default setting.

image
image

  1. My sensor_data seems ok too.

image

The ctrlLog.txt seems ok (after I fixed that int problem that I posted earlier as a FIXED issue.)
image

My indicatorLog.txt always looks like this. I suppose this could possibly be a problem as it's quite confusing what the indicatorLog.txt is even for. I realize hitting X, Y, or B turns the vehicleIndicator to -1, 0, and 1, but it doesn't really make sense why.

image

I realize the indicatorLog.txt gets merged with ctrlLog.txt and rgbFrames.txt into the following combined file, but all seems good assuming a "cmd" of 1 from indicatorLog.txt is the value I want for the post-processing.
image

  1. In the Jupiter Notebook everything seems to run correctly. It opens my manually created folders correctly after I modified the Python code to read the correct manually created folders I created. It reads in my sample data. It removes my frames where the motors were at 0.

image

I get the correct amount of training frames and test frames.

image

In this part I am confused as to these Clipping input data errors and to what Cmd means as it seems to relate to indicatorLog.txt but I'm not sure what a -1, 0, or 1 would indicate in the caption above the images. My guess on the Label is that those are the motor values that would be generated during a Network run on the OpenBot for each image, but not sure since each one says the same motor value of 0.23.

image

In Step 31 of the Jupiter Notebook the output seems fine.

image

In Step 33 the epochs all seem to have run correctly. They took quite a while to finish.

image

And in Step 34 thru 37 the graph seems reasonable, but not really sure what to expect here...

image

image

In Step 41 this seems to be ok, but it's making me think Pred means "prediction" which are the motor values. Still not sure what the Cmd and Label are then.
image

  1. Once the best.tflite file is generated and placed into the "checkpoints" folder...

image

I then copy it to the "networks" folder for the Android app, rename it to "autopilot_float.tflite" and recompile the Android app.

image

Here is Android Studio recompiling.

image

That's about all I can think of to describe what I'm doing to try to get the training going. I would really love to get this working. Your help is greatly appreciated.

Thanks,
John

@thias15
Copy link
Collaborator

thias15 commented Sep 14, 2020

Hi John.

Thank you very much for your detailed issue, I really appreciate it! This makes it much easier to help. First the good news: your procedure it correct. Now let me clarify a few things.

  1. Cmd: This corresponds to a high-level command such as "turn left/right" or "go straight" at the next intersection. It is encoded as -1: left, 0: straight, 1: right. As you pointed out, this command can be controlled with the X, Y, or B buttons on the game controller. If you have LEDs connected, it will also control the left/right indicator signals of the car. These commands are logged in the indicatorLog.txt file. During training, the network is conditioned on these commands. If you approach intersection where car could go left, straight or right it is not clear what it should do based on the image only. This is where these commands come in to clear up these ambiguities. It seems that you just want the car to drive along a path in your house. In this case, I would recommend to just keep this cmd at 0. NOTE: This command should be the same when you test the policy on the car.

  2. Label, Pred: These are the control signals of the car, mapped from -255,255 to -1,1. The label is obtained by logging the control that was used to drive the car. The prediction is what the network predicts to be the correct value given an image.

  3. Clipping for image display: this is due to the data augmentation which results in some image values outside the valid range. You can just ignore this.

Now a few comments that will hopefully help you to get it to work.

  1. The same motor value of 0.23 is a problem. This should not happen. Please try to delete the files in the sensor_data folder that were generated ("matched_..."). When you run the Jupyter notebook again, they will be regenerated.
  2. In general the label values seems very low. We have used the "Fast" mode for data collection. I would recommend to do the same. Note that in lines 43-45 of the dataloader.py file, value are normalized into the range -1,1.
    def get_label(self, file_path):
        index = self.index_table.lookup(file_path)
        return self.cmd_values[index], self.label_values[index]/255

For the "Normal" mode, the maximum is capped at 192. For the "Slow" mode at 128.

  1. Depending on the difficulty of the task, you may have to collect significantly more data. Could describe in a bit more detail, your data collection process and the driving task? Also, you may need to train for more epochs.

Hope this helps. Please keep me updated.

@chilipeppr
Copy link
Author

chilipeppr commented Sep 14, 2020 via email

@chilipeppr
Copy link
Author

Ok, here's a video of how I train. I used Fast mode (vs Normal or Slow). I set to AUTOPILOT_F and used NNAPI.

https://photos.app.goo.gl/o6BtAHunDjtj8fMNA

And then here's a video of playing back that training. It still doesn't quite work, but I do seem to be getting more movement in the robot with training in Fast mode vs Normal.

https://photos.app.goo.gl/kCw4DpRN6vPpbtCcA

@parixit
Copy link

parixit commented Sep 14, 2020

@chilipeppr super helpful video! It would be great if you could a step-by-step video of your build for complete newbies.

@chilipeppr
Copy link
Author

chilipeppr commented Sep 14, 2020 via email

@parixit
Copy link

parixit commented Sep 14, 2020

Agreed! This project is daunting but want to do it together with my kids. Waiting on the parts and I had our local library 3D print the parts (even they were interested in the project). I'll look forward to your videos, keep us posted!

@chilipeppr
Copy link
Author

Is it possible that with my kitchen island I have to train each turn around the island as a right turn? Meaning turn on Cmd = 0 on the straight parts and then turn on Cmd = 1 as I turn right 4 times?

@thias15
Copy link
Collaborator

thias15 commented Sep 14, 2020

@chilipeppr If you would like to contribute with build videos that would be awesome and we would be very happy to include them for others in the README! I realize that a lot of people require much more detailed instructions. We are working to provide more comprehensive documentation, but at the moment I have a lot of other engagements as well. For the time lapse video, I did record video of a complete build, but did not get a change to edit it yet. If you like, I'd be happy to setup a quick call with you to coordinate.

@thias15
Copy link
Collaborator

thias15 commented Sep 14, 2020

The predicted control values still seem to be too low. Could you post the figures at the end of training? I'm afraid, the model did not converge properly or overfit. The training and validation loss should both decrease and the direction and angle metrics should both increase.

The task of your choice should be learnable and keeping the indicator command at 0 should be fine since you are driving along a fixed path. However, I suspect that you need to train the model for more epochs and that you need more training data. I would recommend to:

  1. Collect maybe 10 datasets with 5 loops each driving around the kitchen block. Start/stop logging should ideally be done while driving along the trajectory. In the video you have recorded, you drive somewhere else at the end before the logging is stopped. This could lead to difficulty during training, especially if there is not a lot of data.
  2. Take 8 of these datasets for training and the remaining two for validation. By monitoring the validation metrics, you should get a good idea of when the model is starting to work.

Collecting good/clean data is key to machine learning. I know it is not a lot of fun to collect such data, but it is what makes it work in the end! Keep up the great work. Looking forward to your next update (hopefully with the robot driving autonomously).

@chilipeppr
Copy link
Author

Ok, I retrained with 10 datasets -- 8 for the training and 2 for the testing. Each run was 5 to 7 loops around the kitchen island. I turned the noise on for 3 of the dataset runs as well.

Here's a video of how I did the training. It's similar to my first post, but I started logging while in motion. I kept the Cmd=0 (default).
https://www.youtube.com/watch?v=W7EHo0Jk02A

On the phone these are the zip files that I copied and extracted to the train_data and test_data folders. Notice they're all around 40MB to 80MB in size which feels correct from a size per training session. Again, I used crop_img.
image

Here are the 8 training datasets placed into the policy/dataset folder.
image

Here are the 2 test datasets.
image

I also ran it at Normal speed, but changed the divider to 192 in dataloader.py from the 255 value in there by default since it assumes Fast mode.
image

I also did the start/stop logging by hitting the A button on the XBox controller while I was in motion on the robot on the start and stop so I would log no speeds of 0. You can see for the 10 datasets I had almost no frames removed for speed 0. I'm even surprised I ended up with any frames of speed 0 in the output because I don't recall stopping, so that's a bit of a concern.

image

I ended up with the most amount of frames I've ever trained with.

image

I ended up with much higher numbers in the Label here than the 0.23 numbers you were worried about in my original post.

image

Here is the mode.fit output. I'd love to understand what the loss, direction_metric, and angle_metric mean to know whether this output seems reasonable or not.

image
image

Here is the Evaluation data.

image
image

I'm a little worried about these warnings, but maybe they're ok to ignore.
image

And then here's the final output with predictions. The motor values in the predictions sure seem better.

image

However, when I go to run the Autopilot with this new model, it still seems to have failed. The only progress is I now have motor movement. Before the motor values were so low I had no movement. Here's a video of the auto-pilot running and the robot not staying on the path but rather just running into chairs.

https://www.youtube.com/watch?v=a0-0lh7_j0E

@chilipeppr
Copy link
Author

Hmm. Do I need to edit any of these variables? My crop_img images are 256x96.

image

@chilipeppr
Copy link
Author

Well, apparently the crop_imgs must be correct because I tried doing a training session with preview_img and when I went to train I got these errors.

image

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

  1. crop_imgs is correct
  2. Can you try to change the batch size to a value which is as high as possible? We trained with a batch size of 128, but this will most likely require a GPU. If you cannot, you need to decrease the learning rate. From the plots it looks like it is too high for the dataset.
  3. I'm not sure if rescaling all the values by 192 would work since the car was never actually driven with the resulting values. Did you mix the "Normal" and "Fast" datasets?
  4. In line 29, the fact that the label for all images is the same (0.88,0.73) is definitely problematic as well. (The reverse label is generated by FLIP_AUG). For your task of going around the kitchen block in one direction you should probably set FLIP_AUG to False!
  5. If you like you can upload your dataset somewhere and I'll have a look. This would be much quicker to debug.

@chilipeppr
Copy link
Author

Here is a link to download my dataset. It's the 10 sessions I ran yesterday based on your initial feedback. 8 of the sessions are in train_data as a Zip file. 2 of the sessions are in test_data as a Zip file.

https://drive.google.com/drive/folders/18MchBUtods4sRerSpaA6eTrtC9DPvpbd?usp=sharing

I just tried training the dataset again with your feedback above:

  1. I changed the batch to 128. I have an Nvidia Geforce GTX on my Surface Book 3 so no problem on the GPU needed for that change.
  2. All of my training was done at the Normal speed, so the 192 divider should be ok. There is no Fast in this dataset.
  3. I turned off FLIP_AUG.

image

The results still didn't do anything for me. The robot still acts the same way. I did train for 20 epochs this time.

image
image

The "best fit" was epoch 2 so that was a lot of wasted CPU/GPU going to 20 epochs.

image

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

I will download the data and investigate. The fact that it reaches perfect validation metrics after two epochs and then completely fails is very strange. Did you also try to deploy the last.tflite model or run it on some test images to see if the predictions make sense?

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

When I visualize your data, I do see variation in the labels as expected. Do you still see all the same labels?
Screenshot 2020-09-15 at 19 29 46

@chilipeppr
Copy link
Author

Yeah, in my training my labels are still all the same. So this does seem messed up.

image

@chilipeppr
Copy link
Author

On your question "Did you also try to deploy the last.tflite model" I did and it was the same failure. It just kept showing a motor value around 0.75 on both left and right motors, sometimes jumping to 0.8 and it would just drive right into chairs/walls.

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

This is definitely a problem. In the best case the network will learn this constant label. Did you make any changes to the code? I'm using the exact code from the Github repo with no changes (except FLIP_AUG = false in cell 21). In case you made changes, could you stash them or clone a fresh copy of the repo? The put the same data you uploaded into the corresponding folders and see if you can reproduce what I showed in the comment above.

@chilipeppr
Copy link
Author

I haven't changed any of the code. I did try that last run with the batch size changed and FLIP_AUG = false. I also tried epoch=20. I did change dataloader.py to divide by 192. Other than that the code is totally the same. I can try to re-check out the repo, but I don't think that's going to change much.

One thing I'm trying right now is to create a new conda environment with tensorflow instead of tensorflow-gpu as the library.

@chilipeppr
Copy link
Author

Why do I get clipping errors and you don't for utils.show_train_batch?

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

I also get the clipping errors, I just scrolled down so more images with labels are visible. I'm currently running tensorflow on CPU on my laptop without GPU. It will take some time. But it should not make any difference. For the paper all experiments were performed on a workstation with a GPU. One difference is that I only used Mac and Linux. Maybe there is a problem with Windows for the way the labels are looked up? From the screenshots it seems you're on Windows.

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

One thing you could try is running everything in the Linux subsystem of Windows.

@chilipeppr
Copy link
Author

Yes, I'm on Windows. Surface Book 3 with Nvidia GPU.

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

I'll update you in about 30-60 minutes regarding training progress. But it seems that your issue is the label mapping. I suspect at this point it is related to Windows. As I mentioned, you could try to run the code in the Linux subsystem in Windows. I will also see if I can run it in a VM or setup a Windows environment for testing.

@chilipeppr
Copy link
Author

I'm wondering, if you get a final best.tflite file out of your run if you could send that to me to try out on the robot.

I hear you on the label mapping. Could this possibly be something as dumb as Windows doing CR/LF and Mac/Linux using LF?

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

Hello. It finished training for 10 epochs now. The plots look reasonable, so why don't you give it a try. To achieve good performance usually some hyperparameter tuning, more data and more training time is needed. But let's see.
best.tflite.zip

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

notebook.html.zip
This is the complete output of my Jupyter notebook to give you some idea how the output should look like. When I get a chance, I will explore the issue you encounter in a Windows environment. It could be something like CR/LF vs LF, but since the code relies on the os library, these types of things should be taken care of. I don't know, but will let you know what I discover. Thanks for your patience. I really want you to be able to train your own models and will try my best to figure out the issue you are encountering.

@thias15
Copy link
Collaborator

thias15 commented Sep 15, 2020

Note that both files need to be unpacked. I had to zip them in order to upload them here.

@chilipeppr
Copy link
Author

I just tried running your best.tflite and it does not work any better. The robot still runs into walls.

@chilipeppr
Copy link
Author

Yes, I did restart the kernel. I always hit
image

I just tried the delete the matched*.txt and it doesn't seem to fix anyhthing. The labeled_ds seems to just keep repeating the same labels. It still seems that process_train_path doesn't work, or what if the labels are wrong before that step?

@chilipeppr
Copy link
Author

Finally! Those last code changes finally got the labels to be loaded correctly!

image

@chilipeppr
Copy link
Author

Ok, now in your earlier specs, is bz the batch size?

image

And then where do I specify lr?

@thias15
Copy link
Collaborator

thias15 commented Sep 17, 2020

Cell 34: LR = 0.0001

@thias15
Copy link
Collaborator

thias15 commented Sep 17, 2020

Yes, BZ is the batch size.

@chilipeppr
Copy link
Author

Ahh. Ok. I figured all tweakable values were in the same section. I see where LR is now.

@thias15
Copy link
Collaborator

thias15 commented Sep 17, 2020

Good point, I will refactor it. The reason it is there now is because it is directly related to the optimizer.

@thias15 thias15 added this to In progress in OpenBot Sep 17, 2020
@thias15 thias15 moved this from In progress to Done in OpenBot Sep 17, 2020
@thias15
Copy link
Collaborator

thias15 commented Sep 17, 2020

I have just pushed the changes and everything should work now on Windows as well. So feel free to pull if you want a clean copy. I have also added a little more info to the notebook and improved the dataset construction. It is probably two orders of magnitude faster now. This will make your life easier when training with larger datasets.

@thias15 thias15 closed this as completed Sep 17, 2020
@chilipeppr
Copy link
Author

Awesome. I checked out the new changes and will run them.

Question, does flipping the phone over matter for messing up the data collection? I've had the USB port on the right side, but found I get a higher facing trajectory in the images if I flip USB port onto the left. If anything, that may mess up the consistency of the training, which is that the phone has a slightly different upward tilt thus the horizon is lower in the image.

I do know for running the person detect AI you need the phone flipped the correct way, as initially I had it flipped the wrong way and the robot kept driving away from the person in frame. Once I accidentally flipped the phone the other way it started working and I was surprised to realize the mistake. Worth putting into the docs.

@thias15
Copy link
Collaborator

thias15 commented Sep 17, 2020

The image will be rotated automatically, so it should not affect data collection. As long as the phone is mounted horizontally, it should not make a difference. If it is mounted vertically, the problem will be the limited horizontal field of view and the image cropping. For the person following, it actually works in both horizontal and vertical orientation. I just tested the "opposite" horizontal orientation (180 degrees) and observed the behaviour you described. This seems to be a bug and I did not notice it before. It is probably related to the logic that detects the phone orientation and adapts the computation of the motor controls. I will look into it and fix it.

@thias15
Copy link
Collaborator

thias15 commented Sep 17, 2020

By the way, I have also noticed that if you train the pilot_net network for your dataset, it seems to achieve much better performance. It is the default in the new notebook. It has a much bigger capacity, but still runs in real-time. However, due to the larger network you may run into memory issues during training depending on the specs of your machine. If it does not work, just change the model back to cil_mobile. In the training section, first cell change
model = models.pilot_net(NETWORK_IMG_WIDTH,NETWORK_IMG_HEIGHT,BN)
to
model = models.cil_mobile(NETWORK_IMG_WIDTH,NETWORK_IMG_HEIGHT,BN)

@chilipeppr
Copy link
Author

Oh, interesting. On my Surface Book 3 I have 32GB of RAM so hopefully I'm in good shape for running it on pilot_net. Maybe that should be a configuration up at the top of the notebook too with some explanation of the difference on the models. I would have not realized this without this comment. I'll try to run my data against it right now. I actually just collected a bunch more data with noise turned on and with placing objects in the path to get the data even more interesting.

@thias15
Copy link
Collaborator

thias15 commented Sep 17, 2020

Cool, let me know how it goes.

@chilipeppr
Copy link
Author

chilipeppr commented Sep 17, 2020

Hmm. With the changes you made to these lines you are slurping in data for other datasets than the ones I specify at the top. Not sure you meant to do that. I realized this because the debug output below it was showing other folders and then I got an error saying my images were different sizes, which I only got because some older datasets I tried training at the "preview" size rather than "crop" size. I fixed it by just removing my older datasets to outside the train_data and test_data, but that is a difference.

image

@thias15
Copy link
Collaborator

thias15 commented Sep 17, 2020

Yes, sorry the assumption here is that you want to train on all data in the train_data_dir. The individual datasets you set will be ignored. I changed this, because it is much faster. I'll see if I can come up with a better solution. I guess in the mean-time, just revert to the old way.

@chilipeppr
Copy link
Author

It is a lot faster, so I'm enjoying the change.

@chilipeppr
Copy link
Author

I'm running the epochs with your latest code right now. I'm on epoch 5 out of 10. Here's my CPU/RAM/GPU usage. It is using a lot of RAM and CPU, but it does not appear to be using my GPU. Any ideas? I did install tensorflow-gpu as my python library and I'm on version 2.3.0.

image

Here's the Nvidia GPU stats in Task Manager. Zero usage.

image

@chilipeppr
Copy link
Author

Ok, I fixed it by moving back to the conda environment which only has 2.1.0 as the latest version of tensorflow-gpu and not using my direct install of python with the pip install of tensorflow-gpu. Here's my GPU usage now under the conda environment.

image

@chilipeppr
Copy link
Author

https://www.youtube.com/watch?v=q0yYN-Ohqwc

Here is the latest video of my latest tflite build. I give it a score of 70%.

I realized you closed this issue, but this is the latest run with about 100,000 images and 10,000 test images of just a simple circle around my kitchen. My goal is to get it to 99% so I figure I'll train it with 200,000 images and with 20,000 test images in hopes that this gets me to a reasonable spot.

@thias15
Copy link
Collaborator

thias15 commented Sep 18, 2020

Yes makes sense. Feel free to reopen if you feel it is not solved. I closed it because the original issue was solved (getting it to train correctly) which was related to the Windows OS. You are raising other issues now (e.g. conda version, final task performance, etc.) which are very interesting and I'm happy to help. However, I would prefer to have a seperate issue with descriptive title for each. This way it can help others with similar questions later on.

@thias15 thias15 added documentation Improvements or additions to documentation good first issue Good for newcomers labels Sep 18, 2020
@thias15
Copy link
Collaborator

thias15 commented Sep 18, 2020

Hmm. With the changes you made to these lines you are slurping in data for other datasets than the ones I specify at the top. Not sure you meant to do that. I realized this because the debug output below it was showing other folders and then I got an error saying my images were different sizes, which I only got because some older datasets I tried training at the "preview" size rather than "crop" size. I fixed it by just removing my older datasets to outside the train_data and test_data, but that is a difference.

image

This is fixed now. The default is to use all datasets, but you can specify individual datasets as well.

@Pascal66
Copy link

You didn't look at the right place, it's not 3D tab, but the CUDA tab

Here's the Nvidia GPU stats in Task Manager. Zero usage.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
OpenBot
  
Done
Development

No branches or pull requests

4 participants