-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't get Autopilot to train correctly #31
Comments
Hi John. Thank you very much for your detailed issue, I really appreciate it! This makes it much easier to help. First the good news: your procedure it correct. Now let me clarify a few things.
Now a few comments that will hopefully help you to get it to work.
For the "Normal" mode, the maximum is capped at 192. For the "Slow" mode at 128.
Hope this helps. Please keep me updated. |
That is super helpful. I think my earlier training might have been closer
to correct where I just left the Cmd at 0, but I did train in Normal mode
so all of my speeds being played back were really low but did seem to try
to change higher or lower as I manually moved the camera around to follow
the path. The values just never quite reached high enough to get the motors
moving. I would say they lingered in the 0.1 range and maybe got to 0.2 as
I moved the camera around. I even wrote code to amplify the speeds later,
but that didn't quite work. I think I'll try to just record in Fast and/or
make those code changes in dataloader.py.
In terms of how I'm training, I'm just steering the car around my kitchen
island over and over in a circle about 10 times to get a full logging to
analyze. I figured I'd start simple and at least just get it going in a
circle in one direction.
…On Mon, Sep 14, 2020 at 3:46 AM thias15 ***@***.***> wrote:
Hi John.
Thank you very much for your detailed issue, I really appreciate it! This
makes it much easier to help. First the good news: your procedure it
correct. Now let me clarify a few things.
1.
Cmd: This corresponds to a high-level command such as "turn
left/right" or "go straight" at the next intersection. It is encoded as -1:
left, 0: straight, 1: right. As you pointed out, this command can be
controlled with the X, Y, or B buttons on the game controller. If you have
LEDs connected, it will also control the left/right indicator signals of
the car. These commands are logged in the indicatorLog.txt file. During
training, the network is conditioned on these commands. If you approach
intersection where car could go left, straight or right it is not clear
what it should do based on the image only. This is where these commands
come in to clear up these ambiguities. It seems that you just want the car
to drive along a path in your house. In this case, I would recommend to
just keep this cmd at 0. NOTE: This command should be the same when you
test the policy on the car.
2.
Label, Pred: These are the control signals of the car, mapped from
-255,255 to -1,1. The label is obtained by logging the control that was
used to drive the car. The prediction is what the network predicts to be
the correct value given an image.
3.
Clipping for image display: this is due to the data augmentation which
results in some image values outside the valid range. You can just ignore
this.
Now a few comments that will hopefully help you to get it to work.
1. The same motor value of 0.23 is a problem. This should not happen.
Please try to deleted the files in the sensor_data folder that were
generated ("matched_...").
2. In general the label values seems very low. We have used the "Fast"
mode for data collection. I would recommend to do the same. If you use the
"Normal" or "Slow" modes, you will need to make a change in lines 43-45 of
the dataloader.py file.
def get_label(self, file_path):
index = self.index_table.lookup(file_path)
return self.cmd_values[index], self.label_values[index]/255
For the "Normal" mode, replace the 255 by 192. For the "Slow" mode,
replace the 255 by 128.
I will update the code to be more user-friendly in the future.
1. Depending on the difficulty of the task, you may have to collect
significantly more data. Could describe in a bit more detail, your data
collection process and the driving task? Also, you may need to train for
more epochs.
Hope this helps. Please keep me updated.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4J23J6WAI25Z2ZB7CDA5TSFXYAXANCNFSM4RLBOAWA>
.
|
Ok, here's a video of how I train. I used Fast mode (vs Normal or Slow). I set to AUTOPILOT_F and used NNAPI. https://photos.app.goo.gl/o6BtAHunDjtj8fMNA And then here's a video of playing back that training. It still doesn't quite work, but I do seem to be getting more movement in the robot with training in Fast mode vs Normal. |
@chilipeppr super helpful video! It would be great if you could a step-by-step video of your build for complete newbies. |
I would love to. I figure I might be one of the first to build one of these
outside of the Intel team after the public posting of the project as I
happened to have every piece needed already sitting in my home workshop, so
no need to wait for shipping. It's hard getting this going without more
Youtube videos!
…On Mon, Sep 14, 2020 at 9:55 AM Parixit ***@***.***> wrote:
@chilipeppr <https://github.com/chilipeppr> super helpful video! It would
be great if you could a step-by-step video of your build for complete
newbies.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4J23LWNCEALSQT3WH7IRDSFZDHRANCNFSM4RLBOAWA>
.
|
Agreed! This project is daunting but want to do it together with my kids. Waiting on the parts and I had our local library 3D print the parts (even they were interested in the project). I'll look forward to your videos, keep us posted! |
Is it possible that with my kitchen island I have to train each turn around the island as a right turn? Meaning turn on Cmd = 0 on the straight parts and then turn on Cmd = 1 as I turn right 4 times? |
@chilipeppr If you would like to contribute with build videos that would be awesome and we would be very happy to include them for others in the README! I realize that a lot of people require much more detailed instructions. We are working to provide more comprehensive documentation, but at the moment I have a lot of other engagements as well. For the time lapse video, I did record video of a complete build, but did not get a change to edit it yet. If you like, I'd be happy to setup a quick call with you to coordinate. |
The predicted control values still seem to be too low. Could you post the figures at the end of training? I'm afraid, the model did not converge properly or overfit. The training and validation loss should both decrease and the direction and angle metrics should both increase. The task of your choice should be learnable and keeping the indicator command at 0 should be fine since you are driving along a fixed path. However, I suspect that you need to train the model for more epochs and that you need more training data. I would recommend to:
Collecting good/clean data is key to machine learning. I know it is not a lot of fun to collect such data, but it is what makes it work in the end! Keep up the great work. Looking forward to your next update (hopefully with the robot driving autonomously). |
Ok, I retrained with 10 datasets -- 8 for the training and 2 for the testing. Each run was 5 to 7 loops around the kitchen island. I turned the noise on for 3 of the dataset runs as well. Here's a video of how I did the training. It's similar to my first post, but I started logging while in motion. I kept the Cmd=0 (default). On the phone these are the zip files that I copied and extracted to the train_data and test_data folders. Notice they're all around 40MB to 80MB in size which feels correct from a size per training session. Again, I used crop_img. Here are the 8 training datasets placed into the policy/dataset folder. I also ran it at Normal speed, but changed the divider to 192 in dataloader.py from the 255 value in there by default since it assumes Fast mode. I also did the start/stop logging by hitting the A button on the XBox controller while I was in motion on the robot on the start and stop so I would log no speeds of 0. You can see for the 10 datasets I had almost no frames removed for speed 0. I'm even surprised I ended up with any frames of speed 0 in the output because I don't recall stopping, so that's a bit of a concern. I ended up with the most amount of frames I've ever trained with. I ended up with much higher numbers in the Label here than the 0.23 numbers you were worried about in my original post. Here is the mode.fit output. I'd love to understand what the loss, direction_metric, and angle_metric mean to know whether this output seems reasonable or not. Here is the Evaluation data. I'm a little worried about these warnings, but maybe they're ok to ignore. And then here's the final output with predictions. The motor values in the predictions sure seem better. However, when I go to run the Autopilot with this new model, it still seems to have failed. The only progress is I now have motor movement. Before the motor values were so low I had no movement. Here's a video of the auto-pilot running and the robot not staying on the path but rather just running into chairs. |
|
Here is a link to download my dataset. It's the 10 sessions I ran yesterday based on your initial feedback. 8 of the sessions are in train_data as a Zip file. 2 of the sessions are in test_data as a Zip file. https://drive.google.com/drive/folders/18MchBUtods4sRerSpaA6eTrtC9DPvpbd?usp=sharing I just tried training the dataset again with your feedback above:
The results still didn't do anything for me. The robot still acts the same way. I did train for 20 epochs this time. The "best fit" was epoch 2 so that was a lot of wasted CPU/GPU going to 20 epochs. |
I will download the data and investigate. The fact that it reaches perfect validation metrics after two epochs and then completely fails is very strange. Did you also try to deploy the last.tflite model or run it on some test images to see if the predictions make sense? |
On your question "Did you also try to deploy the last.tflite model" I did and it was the same failure. It just kept showing a motor value around 0.75 on both left and right motors, sometimes jumping to 0.8 and it would just drive right into chairs/walls. |
This is definitely a problem. In the best case the network will learn this constant label. Did you make any changes to the code? I'm using the exact code from the Github repo with no changes (except FLIP_AUG = false in cell 21). In case you made changes, could you stash them or clone a fresh copy of the repo? The put the same data you uploaded into the corresponding folders and see if you can reproduce what I showed in the comment above. |
I haven't changed any of the code. I did try that last run with the batch size changed and FLIP_AUG = false. I also tried epoch=20. I did change dataloader.py to divide by 192. Other than that the code is totally the same. I can try to re-check out the repo, but I don't think that's going to change much. One thing I'm trying right now is to create a new conda environment with tensorflow instead of tensorflow-gpu as the library. |
Why do I get clipping errors and you don't for utils.show_train_batch? |
I also get the clipping errors, I just scrolled down so more images with labels are visible. I'm currently running tensorflow on CPU on my laptop without GPU. It will take some time. But it should not make any difference. For the paper all experiments were performed on a workstation with a GPU. One difference is that I only used Mac and Linux. Maybe there is a problem with Windows for the way the labels are looked up? From the screenshots it seems you're on Windows. |
One thing you could try is running everything in the Linux subsystem of Windows. |
Yes, I'm on Windows. Surface Book 3 with Nvidia GPU. |
I'll update you in about 30-60 minutes regarding training progress. But it seems that your issue is the label mapping. I suspect at this point it is related to Windows. As I mentioned, you could try to run the code in the Linux subsystem in Windows. I will also see if I can run it in a VM or setup a Windows environment for testing. |
I'm wondering, if you get a final best.tflite file out of your run if you could send that to me to try out on the robot. I hear you on the label mapping. Could this possibly be something as dumb as Windows doing CR/LF and Mac/Linux using LF? |
Hello. It finished training for 10 epochs now. The plots look reasonable, so why don't you give it a try. To achieve good performance usually some hyperparameter tuning, more data and more training time is needed. But let's see. |
notebook.html.zip |
Note that both files need to be unpacked. I had to zip them in order to upload them here. |
I just tried running your best.tflite and it does not work any better. The robot still runs into walls. |
Cell 34: LR = 0.0001 |
Yes, BZ is the batch size. |
Ahh. Ok. I figured all tweakable values were in the same section. I see where LR is now. |
Good point, I will refactor it. The reason it is there now is because it is directly related to the optimizer. |
I have just pushed the changes and everything should work now on Windows as well. So feel free to pull if you want a clean copy. I have also added a little more info to the notebook and improved the dataset construction. It is probably two orders of magnitude faster now. This will make your life easier when training with larger datasets. |
Awesome. I checked out the new changes and will run them. Question, does flipping the phone over matter for messing up the data collection? I've had the USB port on the right side, but found I get a higher facing trajectory in the images if I flip USB port onto the left. If anything, that may mess up the consistency of the training, which is that the phone has a slightly different upward tilt thus the horizon is lower in the image. I do know for running the person detect AI you need the phone flipped the correct way, as initially I had it flipped the wrong way and the robot kept driving away from the person in frame. Once I accidentally flipped the phone the other way it started working and I was surprised to realize the mistake. Worth putting into the docs. |
The image will be rotated automatically, so it should not affect data collection. As long as the phone is mounted horizontally, it should not make a difference. If it is mounted vertically, the problem will be the limited horizontal field of view and the image cropping. For the person following, it actually works in both horizontal and vertical orientation. I just tested the "opposite" horizontal orientation (180 degrees) and observed the behaviour you described. This seems to be a bug and I did not notice it before. It is probably related to the logic that detects the phone orientation and adapts the computation of the motor controls. I will look into it and fix it. |
By the way, I have also noticed that if you train the |
Oh, interesting. On my Surface Book 3 I have 32GB of RAM so hopefully I'm in good shape for running it on pilot_net. Maybe that should be a configuration up at the top of the notebook too with some explanation of the difference on the models. I would have not realized this without this comment. I'll try to run my data against it right now. I actually just collected a bunch more data with noise turned on and with placing objects in the path to get the data even more interesting. |
Cool, let me know how it goes. |
Hmm. With the changes you made to these lines you are slurping in data for other datasets than the ones I specify at the top. Not sure you meant to do that. I realized this because the debug output below it was showing other folders and then I got an error saying my images were different sizes, which I only got because some older datasets I tried training at the "preview" size rather than "crop" size. I fixed it by just removing my older datasets to outside the train_data and test_data, but that is a difference. |
Yes, sorry the assumption here is that you want to train on all data in the train_data_dir. The individual datasets you set will be ignored. I changed this, because it is much faster. I'll see if I can come up with a better solution. I guess in the mean-time, just revert to the old way. |
It is a lot faster, so I'm enjoying the change. |
I'm running the epochs with your latest code right now. I'm on epoch 5 out of 10. Here's my CPU/RAM/GPU usage. It is using a lot of RAM and CPU, but it does not appear to be using my GPU. Any ideas? I did install tensorflow-gpu as my python library and I'm on version 2.3.0. Here's the Nvidia GPU stats in Task Manager. Zero usage. |
https://www.youtube.com/watch?v=q0yYN-Ohqwc Here is the latest video of my latest tflite build. I give it a score of 70%. I realized you closed this issue, but this is the latest run with about 100,000 images and 10,000 test images of just a simple circle around my kitchen. My goal is to get it to 99% so I figure I'll train it with 200,000 images and with 20,000 test images in hopes that this gets me to a reasonable spot. |
Yes makes sense. Feel free to reopen if you feel it is not solved. I closed it because the original issue was solved (getting it to train correctly) which was related to the Windows OS. You are raising other issues now (e.g. conda version, final task performance, etc.) which are very interesting and I'm happy to help. However, I would prefer to have a seperate issue with descriptive title for each. This way it can help others with similar questions later on. |
This is fixed now. The default is to use all datasets, but you can specify individual datasets as well. |
Thanks again for the great work on this project.
I've spent a couple days now trying to get the Autopilot to train and nothing has quite worked for me. All I get when I turn the Network on after training/post-processing/recompiling the Android app is the OpenBot driving in a slow straight line and crashing into the wall.
Here's what I've gone through thus far...
Once I've created about 5 minutes worth of data from driving around I turn off Logging by hitting A again on the XBox controller. I hear the MP3 file play of "Logging stopped". This part seems fine.
I download the Zip file of the logging and place it the policy folder. I'm showing the hierarchy here because your docs say to create a folder called "train" but the Python script looks for "train_data". I also initially didn't realize you had to create manual folders for your set of log data, so I now have that correct such that I do get through the Jupiter Notebook process fine rather than failing on Step 10, which is what happens if you create your folder structure incorrectly.
My images seem to be fine. The resolution is small at 256x96 but I presume that's the correct size for the crop_img default setting.
The ctrlLog.txt seems ok (after I fixed that int problem that I posted earlier as a FIXED issue.)
My indicatorLog.txt always looks like this. I suppose this could possibly be a problem as it's quite confusing what the indicatorLog.txt is even for. I realize hitting X, Y, or B turns the vehicleIndicator to -1, 0, and 1, but it doesn't really make sense why.
I realize the indicatorLog.txt gets merged with ctrlLog.txt and rgbFrames.txt into the following combined file, but all seems good assuming a "cmd" of 1 from indicatorLog.txt is the value I want for the post-processing.
I get the correct amount of training frames and test frames.
In this part I am confused as to these Clipping input data errors and to what Cmd means as it seems to relate to indicatorLog.txt but I'm not sure what a -1, 0, or 1 would indicate in the caption above the images. My guess on the Label is that those are the motor values that would be generated during a Network run on the OpenBot for each image, but not sure since each one says the same motor value of 0.23.
In Step 31 of the Jupiter Notebook the output seems fine.
In Step 33 the epochs all seem to have run correctly. They took quite a while to finish.
And in Step 34 thru 37 the graph seems reasonable, but not really sure what to expect here...
In Step 41 this seems to be ok, but it's making me think Pred means "prediction" which are the motor values. Still not sure what the Cmd and Label are then.
I then copy it to the "networks" folder for the Android app, rename it to "autopilot_float.tflite" and recompile the Android app.
Here is Android Studio recompiling.
That's about all I can think of to describe what I'm doing to try to get the training going. I would really love to get this working. Your help is greatly appreciated.
Thanks,
John
The text was updated successfully, but these errors were encountered: